Closed Korrd closed 6 years ago
Are any of the nodes managers? There's some light delay at times from the node bring up and the leader election.
Yes, three of the nodes are managers, and three are workers.
At first I thought it might be slow to create the logical swarm, so I waited two hours, and it hadn't yet created it.
@Korrd I meant, can you ssh into any of the managers and get the nodes? Usually one of them will be the leader and sets itself as such. From there it's a bit of manual work to determine what failed, by looking at the different init logs.
There is no logical swarm at all. The docker node ls
command returns Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
This happens on all manager-should-be nodes.
Could you re-run the diagnostics and make sure that it's uploaded? I'm not seeing anything on our end.
swarm-manager000002:~$ docker-diagnose
Done requesting diagnostics.
Your diagnostics session ID is 1520295545-iqRaB7ZYYMhszpemqP4YQsYtFxP2Izsv
Please provide this session ID to the maintainer debugging your issue.
swarm-manager000002:~$
Hmm I'm not seeing anything very relevant - Does the same thing keep happening with new a deployments?
I have deleted the old swarm and re-deployed. This is the result:
swarm-manager000002:~$ docker node ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
swarm-manager000002:~$ docker-diagnose
Done requesting diagnostics.
Your diagnostics session ID is 1520356286-D6biq2uescfSete2eCuMXNVdiISVMmcf
Please provide this session ID to the maintainer debugging your issue.
swarm-manager000002:~$
Any news regarding this issue?
I'm unable to replicate this issue at all - your logs aren't showing anything very relevant. Is this on a shared account, or are you the admin?
Can you join the Docker Community slack, so it's easier for us to discuss this?
I'm global admin. I wonder if my permission set has anything to do with it? I'm looking at the azure side of things, but so far I haven't seen errors nor anything pointing at an issue related to my account permissions.
I have the same issue, running docker-diagnose
gave the following output:
OK hostname=swarm-manager000000 session=1520611240-8lVP6jqAupcJkk4nxGA2R8hRwpvC4nGD
OK hostname=swarm-worker000000 session=1520611240-8lVP6jqAupcJkk4nxGA2R8hRwpvC4nGD
Done requesting diagnostics.
Your diagnostics session ID is 1520611240-8lVP6jqAupcJkk4nxGA2R8hRwpvC4nGD
Please provide this session ID to the maintainer debugging your issue.
Folks, I have gone through the same thing on a newly provisioned swarm cluster. My advice is, once the scripts finish on Azure, leave your home office (or place of working) for 10-15 minutes and then begin work, as there are additional scripts that have to run to completion that are not apparent to you. I have this explicitly stated as part of our disaster recovery docs so that during a panic/crisis moment, i do not forget it.
I did. I waited two hours after provisioning, yet the logical swarm hadn't been created.
We are having the same problem and cannot deploy Docker for Azure on any of our subscriptions anymore. There were no problems a few days ago.
A quick look around in the logs reveals that the script azureleader.py
fails to start because it cannot load the module table
. Changing it to cosmosdb
causes the script to start up correctly.
Please find below the traceback as reported by Python:
Traceback (most recent call last):
File "/usr/bin/azureleader.py", line 9, in <module>
from azure.storage.table import TableService, Entity
ImportError: No module named table
Am getting the same issue, trying to rebuild my corrupted swarm.
Something in the 17.12.1 must be breaking the deployment compared to the 17.12.0 deployment. (I only rebuilt my swarm 10 days ago with 17.12.0 and it worked)
My rebuild process is fully documented, so no changes on that front.
Is there a way to get access to the previous template? https://download.docker.com/azure/stable/Docker.tmpl
Found the previous template https://download.docker.com/azure/stable/17.12.0/Docker.tmpl
(good to know there are) (https://docs.docker.com/docker-for-azure/archive/ is out of date)
Deployed 17.12.0 and it still works.
@djeeg They were removed from archive, as it's preferable for users to deploy the latest and move forward, rather than deploy older templates, and open issues for what has been fixed in the latest release.
I've tried with the template above mentioned, and the issue is still there :(
@Korrd can you provide some info around the swarm logs as seen in the different debug?
Hello,
I have the same issue, also waited more than 15 minutes after it finished deploying. I also tried twice
swarm-manager000000:~$ docker node ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
swarm-manager000000:~$ docker-diagnose
OK hostname=swarm-manager000000 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-manager000001 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-manager000002 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-worker000000 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-worker000001 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
OK hostname=swarm-worker000002 session=1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
Done requesting diagnostics.
Your diagnostics session ID is 1521695645-i3o6933s5EeR0MHQdbHq8PqseAr0B9Cq
Please provide this session ID to the maintainer debugging your issue.
I was looking for more logs in /var/log and in the xxxlogs storage account but could not find anything. The storage account had no files in there. I of course will provide any requested logs as soon as possible in order to make the swarm work.
I have been following the instructions here, mainly for the Principal thing for authorization: https://youtu.be/DQwyIpDcLAk
I tried doing the deploy from docker cloud and was successful: https://docs.docker.com/docker-cloud/cloud-swarm/create-cloud-swarm-azure/
Is this the same as doing it with the template?
BTW, It says "Cluster Management in Docker Cloud will be discontinued on May 21.", does this mean that I will no longer be able to "Create a new swarm on Microsoft Azure in Docker Cloud" ?
@ztrange that's correct - The template will still be valid, but you will no longer get the connectivity via Docker Cloud.
18.03.0 is still broken.
As a workaround execute the following command on each node:
docker ps -a | grep init-azure | ( read ID OTHER; docker restart $ID; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/azureleader.py )
After this all the nodes in my cluster connect successfully:
> docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
suqh6pq45p0j6py55mhf7uvkk * swarm-manager000000 Ready Active Leader 18.03.0-ce
p7m2vf2smd4ij1j9aqr2yeswi swarm-worker000000 Ready Active 18.03.0-ce
s2al7kg2kfd9fkkoxz8ixjcy9 swarm-worker000001 Ready Active 18.03.0-ce
Here is the patch to fix the problem permanently:
--- /a/usr/bin/azureleader.py
+++ /b/usr/bin/azureleader.py
@@ -6,7 +6,7 @@
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.storage import StorageManagementClient
from azure.mgmt.storage.models import StorageAccountCreateParameters
-from azure.storage.table import TableService, Entity
+from azure.cosmosdb.table import TableService, Entity
from azendpt import AZURE_PLATFORMS, AZURE_DEFAULT_ENV
PARTITION_NAME = 'tokens'
Also as a side note: this is a second release that has this critical problem. Looks like no one cares to run this sh*t at least once to check whether it works before making new release.
@sentinelt I understand the frustration. Please keep in mind that this deployment is using the different OSS releases underneath it, and is provided to the community as such. We try to catch errors upstream, but some end up slipping through. In this case, Azure updated their libraries for all storage, which also broke parts of our build. See my rant here: https://github.com/Azure/azure-storage-blob-go/issues/35
Thank you for the patch, it's been added to our main repo, and will be part of our next release.
@sentinelt's script does get a 18.03.0 swarm to form, however cloudstor looks to have the same storage reference issue
Plugin starts disabled
swarm-manager000000:$ docker plugin ls
ID NAME DESCRIPTION ENABLED
0afc5d4f0122 cloudstor:azure cloud storage plugin for Docker false
Try to enable it
swarm-manager000000:$ docker plugin enable 0afc5d4f0122
Error response from daemon: dial unix /run/docker/plugins/0afc5d4f0122/cloudstor.sock: connect: no such file or directory
Check init logs, see [azure.storage.table] reference in sakey.py
Install cloudstor …
Install storage plugin
Traceback (most recent call last):
File "/usr/bin/sakey.py", line 9, in <module>
from azure.storage.table import TableService, Entity
ImportError: No module named table
18.03.0-ce-azure1: Pulling from docker4x/cloudstor
8bb80f59b17d: Download complete
Digest: sha256:84cb62d9fd8904f69d681af000fe82d7555944a566349c651ae7b65dc36900db
Status: Downloaded newer image for docker4x/cloudstor:18.03.0-ce-azure1
Error response from daemon: dial unix /run/docker/plugins/450ec08efc55342/cloudstor.sock: connect: no such file or directory
As yet I have not figure out a way to get cloudstor enabled (ie using @sentinelt script)
docker ps -a | grep init-azure | ( read ID OTHER; docker restart $ID; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/sakey.py )
Oh I see now, I need to delete the plugin before re-running the init container
install_cloudstor_plugin()
{
echo "Install storage plugin"
SA_KEY=$(sakey.py)
docker plugin install --alias cloudstor:azure --grant-all-permissions docker4x/cloudstor:$DOCKER_FOR_IAAS_VERSION \
CLOUD_PLATFORM=AZURE \
AZURE_STORAGE_ACCOUNT_KEY="$SA_KEY" \
AZURE_STORAGE_ACCOUNT="$SWARM_INFO_STORAGE_ACCOUNT" \
AZURE_STORAGE_ENDPOINT="$STORAGE_ENDPOINT" \
DEBUG=1
}
(or running the script to restart/update /usr/bin/sakey.py as soon as the VM boots also works)
Is this issue really fixed..? I am still facing this since yesterday. Following is output from docker-diagnose. Your diagnostics session ID is 1525171459-yvkHDg02EB41cDTFeehHJCP4MLi9TNX2 Please provide this session ID to the maintainer debugging your issue.
Its wasnt fixed on stable channel aka 18.03.0-ce-azure1
It may be fixed on 18.03.0-ce-azure2, but I havent figured out a way to install it https://hub.docker.com/r/docker4x/init-azure/tags/
I see there is now 18.04.0-ce-azure1 on edge channel, it might be fixed there
In the mean time im using this command on new nodes to fix both issues
docker plugin rm cloudstor:azure || true &&
docker ps -a | \
grep init-azure | \
( read ID OTHER; docker restart $ID; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/azureleader.py; docker exec $ID sed -ire 's,from azure.storage.table ,from azure.cosmosdb.table ,' /usr/bin/sakey.py ) &&
docker logs -f $(docker ps -a | grep init-azure | awk '{print $1}')
Thank you @djeeg you save my day!
Despite being closed, the issue is still not fixed on 18.04.0-ce
I've tried to create a new swarm using the template from https://docs.docker.com/docker-for-azure/
Expected behavior
A new swarm is created, and ready to use.
Actual behavior
A new swarm got created, but the logical swarm didn't form.
Information
Full output of the diagnostics from "docker-diagnose" ran from one of the instance
Docker-diagnose session ID:
1520262540-jdnAcVkg6JSGbpQBqQyJw6WB30tuMT0z
Steps to reproduce the behavior
docker node ls
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.