NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.23k stars 164 forks source link

Proxy failed to launch after Helm install #127

Closed superleo closed 1 year ago

superleo commented 1 year ago

Proxy failed to launch after Helm install #127

Env: kubenetes 1.23.1 on ubuntu 20.04, x86_64, helm3

Problem:

Here is the lauch error: daemon.go:142 FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64:

Logs:


kubectl logs ais-proxy-2
aisnode proxy container startup at Tue May  9 07:31:52 UTC 2023
'/var/ais_config/ais.json' -> '/etc/ais/ais.json'
'/var/ais_config/ais_local.json' -> '/etc/ais/ais_local.json'
'/var/statsd_config/statsd.json' -> '/opt/statsd/statsd.conf'
No cached .ais.smap
9 May 07:31:52 - [11] reading config file: /opt/statsd/statsd.conf
9 May 07:31:52 - server is up INFO
aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=proxy -alsologtostderr=true -stderrthreshold=1  -allow_shared_no_disks=false
E 07:31:54.234343 daemon.go:142 FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64: unexpected character: �, error found in #10 byte of ...|rol":   "",
      "p|..., bigger context ...|         "51080",
      "port_intra_control":   "",
      "port_intra_data":      ""
  }
}|...
FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64: unexpected character: �, error found in #10 byte of ...|rol":   "",
      "p|..., bigger context ...|         "51080",
      "port_intra_control":   "",
      "port_intra_data":      ""
  }
}|...

The aislocal.json file:

{
  "confdir": "/etc/ais",
  "log_dir": "/var/log/ais",
  "host_net": {
      "hostname":                 "${AIS_PUB_HOSTNAME}",
      "hostname_intra_control":   "${AIS_INTRA_HOSTNAME}",
      "hostname_intra_data":      "${AIS_DATA_HOSTNAME}",
      "port":                 "51080",
      "port_intra_control":   "",
      "port_intra_data":      ""
  }
}

And log after hardcode the hostname:


'/var/ais_config/ais.json' -> '/etc/ais/ais.json'
'/var/ais_config/ais_local.json' -> '/etc/ais/ais_local.json'
'/var/statsd_config/statsd.json' -> '/opt/statsd/statsd.conf'
No cached .ais.smap
9 May 09:06:19 - [11] reading config file: /opt/statsd/statsd.conf
9 May 09:06:19 - server is up INFO
aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=proxy -alsologtostderr=true -stderrthreshold=1  -allow_shared_no_disks=false -ntargets=3
W 09:06:21.127947 config.go:1716 load initial global config "/etc/ais/ais.json"
E 09:06:21.135042 daemon.go:142 FATAL ERROR: failed to load initial global config "/etc/ais/ais.json": cmn.ClusterConfig.ReadObject: found unknown field: compression, error found in #10 byte of ...|mpression": {
      |..., bigger context ...| },
  "backend": {

  },
  "compression": {
          "block_size":   262144,
          "c|...
FATAL ERROR: failed to load initial global config "/etc/ais/ais.json": cmn.ClusterConfig.ReadObject: found unknown field: compression, error found in #10 byte of ...|mpression": {
      |..., bigger context ...| },
  "backend": {

  },
  "compression": {
          "block_size":   262144,
          "c|...
superleo commented 1 year ago

It's ok if downgrade the aisnode image from 3.11 to 3.4 (on dockerhub). But Target Pod not lauched successfully.

alex-aizman commented 1 year ago

well, you basically answered your own question - downgrade helps albeit in a very limited way. Long story short - we currently support backward compatibility maximum two versions back. In other words, you could, for instance, upgrade from 3.15 to 3.17. In addition, we also try hard not to change persistent formats unless there's a very good reason to do so.

In particular:

aisnode args: -config=/etc/ais/ais.json -local_config=/etc/ais/ais_local.json -role=proxy -alsologtostderr=true -stderrthreshold=1  -allow_shared_no_disks=false
E 07:31:54.234343 daemon.go:142 FATAL ERROR: failed to load plain-text local config "/etc/ais/ais_local.json": cmn.LocalConfig.HostNet: cmn.LocalNetConfig.PortIntraData: PortIntraControl: readUint64: unexpected character: �, error found in #10 byte of ...|rol":   "",
      "p|..., bigger context ...|         "51080",
      "port_intra_control":   "",
      "port_intra_data":      ""

tells us that there's something wrong with port_intra_control. And indeed there is:

$ git log -p deploy/dev/local/aisnode_config.sh
...
commit 6053fe2a0d12b9c25212919ce97e420c344c1e95
Author: Prashanth Dintyala <saiprashanth173@gmail.com>
Date:   Fri Feb 19 10:47:19 2021 -0800

    general: separate global and daemon config

    Signed-off-by: Prashanth Dintyala <saiprashanth173@gmail.com>

-                       "port_intra_control": "${PORT_INTRA_CONTROL:-9080}",
-                       "port_intra_data":    "${PORT_INTRA_DATA:-10080}",
...
...

As you can see, the format was changed more than two years ago. It's a lot of time from any perspective including backward compatibility...

superleo commented 1 year ago

It works now, Thank you Alex.