docker-archive / for-azure

27 stars 18 forks source link

Newly provisioned swarm will not form the logical swarm on creation #55

Closed Korrd closed 6 years ago

Korrd commented 6 years ago

I've tried to create a new swarm using the template from https://docs.docker.com/docker-for-azure/

Expected behavior

A new swarm is created, and ready to use.

Actual behavior

A new swarm got created, but the logical swarm didn't form.

Information

Full output of the diagnostics from "docker-diagnose" ran from one of the instance

Docker-diagnose session ID: 1520262540-jdnAcVkg6JSGbpQBqQyJw6WB30tuMT0z

Steps to reproduce the behavior

  1. Go to https://docs.docker.com/docker-for-azure/
  2. Create a swarm (stable channel)
  3. SSH into the swarm
  4. issue docker node ls
  5. You will get the following message Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
FrenchBen commented 6 years ago

Are any of the nodes managers? There's some light delay at times from the node bring up and the leader election.

Korrd commented 6 years ago

Yes, three of the nodes are managers, and three are workers.

At first I thought it might be slow to create the logical swarm, so I waited two hours, and it hadn't yet created it.

FrenchBen commented 6 years ago

@Korrd I meant, can you ssh into any of the managers and get the nodes? Usually one of them will be the leader and sets itself as such. From there it's a bit of manual work to determine what failed, by looking at the different init logs.

Korrd commented 6 years ago

There is no logical swarm at all. The docker node ls command returns Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again. This happens on all manager-should-be nodes.

FrenchBen commented 6 years ago

Could you re-run the diagnostics and make sure that it's uploaded? I'm not seeing anything on our end.

Korrd commented 6 years ago
swarm-manager000002:~$ docker-diagnose
Done requesting diagnostics.
Your diagnostics session ID is 1520295545-iqRaB7ZYYMhszpemqP4YQsYtFxP2Izsv
Please provide this session ID to the maintainer debugging your issue.
swarm-manager000002:~$ 
FrenchBen commented 6 years ago

Hmm I'm not seeing anything very relevant - Does the same thing keep happening with new a deployments?

Korrd commented 6 years ago

I have deleted the old swarm and re-deployed. This is the result:

swarm-manager000002:~$ docker node ls 
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
swarm-manager000002:~$ docker-diagnose
Done requesting diagnostics.
Your diagnostics session ID is 1520356286-D6biq2uescfSete2eCuMXNVdiISVMmcf
Please provide this session ID to the maintainer debugging your issue.
swarm-manager000002:~$ 
Korrd commented 6 years ago

Any news regarding this issue?

FrenchBen commented 6 years ago

I'm unable to replicate this issue at all - your logs aren't showing anything very relevant. Is this on a shared account, or are you the admin?

Can you join the Docker Community slack, so it's easier for us to discuss this?

Korrd commented 6 years ago

I'm global admin. I wonder if my permission set has anything to do with it? I'm looking at the azure side of things, but so far I haven't seen errors nor anything pointing at an issue related to my account permissions.

EDeijl commented 6 years ago

I have the same issue, running docker-diagnose gave the following output:

OK hostname=swarm-manager000000 session=1520611240-8lVP6jqAupcJkk4nxGA2R8hRwpvC4nGD
OK hostname=swarm-worker000000 session=1520611240-8lVP6jqAupcJkk4nxGA2R8hRwpvC4nGD
Done requesting diagnostics.
Your diagnostics session ID is 1520611240-8lVP6jqAupcJkk4nxGA2R8hRwpvC4nGD
Please provide this session ID to the maintainer debugging your issue.
dealproc commented 6 years ago

Folks, I have gone through the same thing on a newly provisioned swarm cluster. My advice is, once the scripts finish on Azure, leave your home office (or place of working) for 10-15 minutes and then begin work, as there are additional scripts that have to run to completion that are not apparent to you. I have this explicitly stated as part of our disaster recovery docs so that during a panic/crisis moment, i do not forget it.

Korrd commented 6 years ago

I did. I waited two hours after provisioning, yet the logical swarm hadn't been created.

marcelvdh commented 6 years ago

We are having the same problem and cannot deploy Docker for Azure on any of our subscriptions anymore. There were no problems a few days ago.

A quick look around in the logs reveals that the script azureleader.py fails to start because it cannot load the module table. Changing it to cosmosdb causes the script to start up correctly.

Please find below the traceback as reported by Python:

Traceback (most recent call last):
 File "/usr/bin/azureleader.py", line 9, in <module>
   from azure.storage.table import TableService, Entity
ImportError: No module named table
djeeg commented 6 years ago

Am getting the same issue, trying to rebuild my corrupted swarm.

Something in the 17.12.1 must be breaking the deployment compared to the 17.12.0 deployment. (I only rebuilt my swarm 10 days ago with 17.12.0 and it worked)

My rebuild process is fully documented, so no changes on that front.

Is there a way to get access to the previous template? https://download.docker.com/azure/stable/Docker.tmpl

djeeg commented 6 years ago

Found the previous template https://download.docker.com/azure/stable/17.12.0/Docker.tmpl

(good to know there are) (https://docs.docker.com/docker-for-azure/archive/ is out of date)

Deployed 17.12.0 and it still works.

FrenchBen commented 6 years ago

@djeeg They were removed from archive, as it's preferable for users to deploy the latest and move forward, rather than deploy older templates, and open issues for what has been fixed in the latest release.

Korrd commented 6 years ago

I've tried with the template above mentioned, and the issue is still there :(

FrenchBen commented 6 years ago

@Korrd can you provide some info around the swarm logs as seen in the different debug?

ztrange commented 6 years ago

Hello,

I have the same issue, also waited more than 15 minutes after it finished deploying. I also tried twice

swarm-manager000000:~$ docker node ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
swarm-manager000000:~$ docker-diagnose
OK hostname=swarm-manager000000 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-manager000001 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-manager000002 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-worker000000 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-worker000001 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-worker000002 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
Done requesting diagnostics.
Your diagnostics session ID is 1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
Please provide this session ID to the maintainer debugging your issue.

I was looking for more logs in /var/log and in the xxxlogs storage account but could not find anything. The storage account had no files in there. I of course will provide any requested logs as soon as possible in order to make the swarm work.

I have been following the instructions here, mainly for the Principal thing for authorization: https://youtu.be/DQwyIpDcLAk

ztrange commented 6 years ago

I tried doing the deploy from docker cloud and was successful: https://docs.docker.com/docker-cloud/cloud-swarm/create-cloud-swarm-azure/

Is this the same as doing it with the template?

BTW, It says "Cluster Management in Docker Cloud will be discontinued on May 21.", does this mean that I will no longer be able to "Create a new swarm on Microsoft Azure in Docker Cloud" ?

FrenchBen commented 6 years ago

@ztrange that's correct - The template will still be valid, but you will no longer get the connectivity via Docker Cloud.

sentinelt commented 6 years ago

18.03.0 is still broken.

As a workaround execute the following command on each node:

  docker ps -a | grep init-azure | ( read ID OTHER; docker restart $ID; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/azureleader.py )                                 

After this all the nodes in my cluster connect successfully:

> docker node ls
ID                            HOSTNAME              STATUS              AVAILABILITY        MANAGER STATUS      ENGINE VERSION
suqh6pq45p0j6py55mhf7uvkk *   swarm-manager000000   Ready               Active              Leader              18.03.0-ce
p7m2vf2smd4ij1j9aqr2yeswi     swarm-worker000000    Ready               Active                                  18.03.0-ce
s2al7kg2kfd9fkkoxz8ixjcy9     swarm-worker000001    Ready               Active                                  18.03.0-ce

Here is the patch to fix the problem permanently:

--- /a/usr/bin/azureleader.py
+++ /b/usr/bin/azureleader.py
@@ -6,7 +6,7 @@
 from azure.mgmt.resource import ResourceManagementClient
 from azure.mgmt.storage import StorageManagementClient
 from azure.mgmt.storage.models import StorageAccountCreateParameters
-from azure.storage.table import TableService, Entity
+from azure.cosmosdb.table import TableService, Entity
 from azendpt import AZURE_PLATFORMS, AZURE_DEFAULT_ENV

 PARTITION_NAME = 'tokens'

Also as a side note: this is a second release that has this critical problem. Looks like no one cares to run this sh*t at least once to check whether it works before making new release.

FrenchBen commented 6 years ago

@sentinelt I understand the frustration. Please keep in mind that this deployment is using the different OSS releases underneath it, and is provided to the community as such. We try to catch errors upstream, but some end up slipping through. In this case, Azure updated their libraries for all storage, which also broke parts of our build. See my rant here: https://github.com/Azure/azure-storage-blob-go/issues/35

Thank you for the patch, it's been added to our main repo, and will be part of our next release.

djeeg commented 6 years ago

@sentinelt's script does get a 18.03.0 swarm to form, however cloudstor looks to have the same storage reference issue

Plugin starts disabled

swarm-manager000000:$ docker plugin ls
ID                  NAME                   DESCRIPTION                       ENABLED
0afc5d4f0122        cloudstor:azure        cloud storage plugin for Docker   false

Try to enable it

swarm-manager000000:$ docker plugin enable 0afc5d4f0122
Error response from daemon: dial unix /run/docker/plugins/0afc5d4f0122/cloudstor.sock: connect: no such file or directory

Check init logs, see [azure.storage.table] reference in sakey.py

Install cloudstor …
Install storage plugin
Traceback (most recent call last):
  File "/usr/bin/sakey.py", line 9, in <module>
    from azure.storage.table import TableService, Entity
ImportError: No module named table
18.03.0-ce-azure1: Pulling from docker4x/cloudstor
8bb80f59b17d: Download complete
Digest: sha256:84cb62d9fd8904f69d681af000fe82d7555944a566349c651ae7b65dc36900db
Status: Downloaded newer image for docker4x/cloudstor:18.03.0-ce-azure1
Error response from daemon: dial unix /run/docker/plugins/450ec08efc55342/cloudstor.sock: connect: no such file or directory

As yet I have not figure out a way to get cloudstor enabled (ie using @sentinelt script)

docker ps -a | grep init-azure | ( read ID OTHER; docker restart $ID; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/sakey.py )
djeeg commented 6 years ago

Oh I see now, I need to delete the plugin before re-running the init container

install_cloudstor_plugin()
{
    echo "Install storage plugin"
    SA_KEY=$(sakey.py)
    docker plugin install --alias cloudstor:azure --grant-all-permissions docker4x/cloudstor:$DOCKER_FOR_IAAS_VERSION  \
        CLOUD_PLATFORM=AZURE \
        AZURE_STORAGE_ACCOUNT_KEY="$SA_KEY" \
        AZURE_STORAGE_ACCOUNT="$SWARM_INFO_STORAGE_ACCOUNT" \
        AZURE_STORAGE_ENDPOINT="$STORAGE_ENDPOINT" \
        DEBUG=1
}

(or running the script to restart/update /usr/bin/sakey.py as soon as the VM boots also works)

Praggie commented 6 years ago

Is this issue really fixed..? I am still facing this since yesterday. Following is output from docker-diagnose. Your diagnostics session ID is 1525171459-yvkHDg02EB41cDTFeehHJCP4MLi9TNX2 Please provide this session ID to the maintainer debugging your issue.

djeeg commented 6 years ago

Its wasnt fixed on stable channel aka 18.03.0-ce-azure1

It may be fixed on 18.03.0-ce-azure2, but I havent figured out a way to install it https://hub.docker.com/r/docker4x/init-azure/tags/

I see there is now 18.04.0-ce-azure1 on edge channel, it might be fixed there

In the mean time im using this command on new nodes to fix both issues

docker plugin rm cloudstor:azure || true &&
    docker ps -a | \
    grep init-azure | \
    ( read ID OTHER; docker restart $ID; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/azureleader.py; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/sakey.py ) &&
    docker logs -f $(docker ps -a | grep init-azure | awk '{print $1}')
alexsandro-xpt commented 6 years ago

Thank you @djeeg you save my day!

nreynis commented 6 years ago

Despite being closed, the issue is still not fixed on 18.04.0-ce