HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

Quick Start Never Completes using docker on WSL2 #103

Closed chrismooney closed 2 years ago

chrismooney commented 2 years ago

I am running docker on WSL2 (Windows Subsystem for Linux) using Ubuntu as the distro.

DISTRIB_ID=Ubuntu DISTRIB_RELEASE=20.04 DISTRIB_CODENAME=focal DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"

Docker version 20.10.8, build 3967b7d

When trying to start, using the quick start instructions, this kept repeating:

waiting for server startup (status: 503)

and finally failed with:

service failed to start SN_1 logs: REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING INFO> healthCheck - node_state: INITIALIZING INFO> register: http://hsds_head:5100/register INFO> register req: http://hsds_head:5100/register body: {'id': 'sn-f80ac', 'port': 5101, 'node_type': 'sn'} INFO> http_post('http://hsds_head:5100/register', {'id': 'sn-f80ac', 'port': 5101, 'node_type': 'sn'}) REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING REQ> GET: /about [localhost:5101] WARN: returning 503 - node_state: INITIALIZING ERROR> Error for http_post(http://hsds_head:5100/register): Cannot connect to host hsds_head:5100 ssl:default [Name or service not known] ERROR> HEAD node seems to be down.

docker ps showed the following:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 41ab0e0aaf5e hdfgroup/hsds "/entrypoint.sh" 4 minutes ago Up 4 minutes 5100-5999/tcp, 0.0.0.0:49161->6101/tcp, :::49161->6101/tcp hsds_dn_1 fd4bdc4a6df0 hdfgroup/hsds "/entrypoint.sh" 4 minutes ago Up 4 minutes 5100/tcp, 5102-5999/tcp, 0.0.0.0:5101->5101/tcp, :::5101->5101/tcp hsds_sn_1 27271d6ac262 hdfgroup/hsds "/entrypoint.sh" 4 minutes ago Up 4 minutes 5100-5999/tcp, 0.0.0.0:49160->6900/tcp, :::49160->6900/tcp hsds_rangeget_1 2d4ef0a42e0e hdfgroup/hsds "/entrypoint.sh" 4 minutes ago Up 4 minutes 5101-5999/tcp, 0.0.0.0:49159->5100/tcp, :::49159->5100/tcp hsds_head_1

And the head node logs:

hsds entrypoint node type: head_node running hsds-headnode INFO> Head node initializing INFO> using bucket: hsdstest INFO> not setting is_dcos INFO> Starting service on port: 5100 ======== Running on http://0.0.0.0:5100 ======== (Press CTRL+C to quit)

Data Node:

INFO> register: http://hsds_head:5100/register INFO> register req: http://hsds_head:5100/register body: {'id': 'dn-6606a', 'port': 6101, 'node_type': 'dn'} INFO> http_post('http://hsds_head:5100/register', {'id': 'dn-6606a', 'port': 6101, 'node_type': 'dn'}) INFO> s3sync - clusterstate is not ready, sleeping INFO> s3sync - clusterstate is not ready, sleeping INFO> s3sync - clusterstate is not ready, sleeping INFO> s3sync - clusterstate is not ready, sleeping ERROR> Error for http_post(http://hsds_head:5100/register): Cannot connect to host hsds_head:5100 ssl:default [Name or service not known] ERROR> HEAD node seems to be down.

And finally rangeget:

hsds entrypoint node type: rangeget running hsds-rangeget INFO> rangeget_proxy start INFO> Using data cache size of: 134217728 INFO> Setting data page size to: 4194304 INFO> Setting data cache expire time to: 3600 INFO> run_app on port: 6900 ======== Running on http://0.0.0.0:6900 ======== (Press CTRL+C to quit)

Is there something I can do to help diagnose and fix this issue?

jreadey commented 2 years ago

Looks like the SN/DN containers are trying to contact the head node using http://hsds_head:5100, but they should be using http://hsds_head_1:5100. And that seems due to some drift between the hdfgroup/hsds:latest image on Docker Hub and the compose file found in the repo.

I updated the hdfgroup/hsds:latest image on Docker Hub to point to the latest bits. Could you try doing a "docker rmi hdfgroup/hsds:latest" and then trying ./runall.sh again?

Also, I've updated the README to be a little more clear.

chrismooney commented 2 years ago

John,

It worked perfectly! Thank you!

no AWS or AZURE env set, using admin/docker/docker-compose.posix.yml Running docker-compose -f admin/docker/docker-compose.posix.yml up Creating network "hsds_default" with the default driver Pulling head (hdfgroup/hsds:)... latest: Pulling from hdfgroup/hsds f7ec5a41d630: Pull complete 3ecd8a7176d5: Pull complete f1400e862d9f: Pull complete 1fc71a5753a5: Pull complete b4b4f72793ed: Pull complete 617506fc2f63: Pull complete c96f75bd0e83: Pull complete Digest: sha256:5157765a0db1d3b60e37a1d90467d1eb5889bd2f307860ba46318479bf3fd9fd Status: Downloaded newer image for hdfgroup/hsds:latest Creating hsds_head_1 ... done Creating hsds_rangeget_1 ... done Creating hsds_sn_1 ... done Creating hsds_dn_1 ... done service ready!

jreadey commented 2 years ago

Good to know that HSDS runs on WSL2! (haven't tried it myself)