ClusterHQ / flocker

Container data volume manager for your Dockerized application
https://clusterhq.com
Apache License 2.0
3.39k stars 290 forks source link

Volume Created/Attached but docker command ends with error #2814

Open gufranmmu opened 8 years ago

gufranmmu commented 8 years ago

When i execute docker run -ti -v test-vol:/data --volume-driver flocker busybox sh I can see from cloud GUI that a new flocker volume as been created and attached to the correct instance, but docker commands ends with the following error on the first run:

Error response from daemon: Cannot start container 255544d3cf494a33038713d26 ef55d5ffc45ddfd38fe8f66e72bcf76a4ae4a23: ConnectingCancelledError: IPv4Addre ss(TCP, 'mesos-master-1', 4523)

and on second try i see this error: docker run -ti -v test-vol:/data --volume-driver flocker busybox sh

Error response from daemon: Cannot start container 795a0aa3734d9d566593a923660d8c1334b7b6be0cdab70aacb6ad4d5a03aa0c: TimeoutError: User timeout caused connection failure.

Here are logs related to mounting:

Jun  6 13:24:42 mesos-slave-1 flocker-docker-plugin: {"task_uuid": "65bbce29-3cc5-4db5-8bc0-c333f91d168c", "task_level": [1], "action_type": "api:request", "timestamp": 1465208682.49836, "method": "POST", "action_status": "started", "request_path": "/VolumeDriver.Mount"}
Jun  6 13:25:40 mesos-slave-1 flocker-docker-plugin: {"task_uuid": "fa855656-2d21-47dc-b582-edb56ca02d0a", "error": false, "timestamp": 1465208740.786148, "message": "\"-\" - - [06/Jun/2016:10:25:40 +0000] \"POST /VolumeDriver.Mount HTTP/1.1\" 200 64 \"-\" \"Go 1.1 package http\"", "message_type": "twisted:log", "task_level": [1]}
Jun  6 13:25:40 mesos-slave-1 kernel: XFS (dm-6): Unmounting Filesystem
Jun  6 13:25:40 mesos-slave-1 flocker-docker-plugin: {"task_uuid": "f3b00a0a-07e9-435e-956d-d97a7323da69", "task_level": [1], "action_type": "api:request", "timestamp": 1465208740.878587, "method": "POST", "action_status": "started", "request_path": "/VolumeDriver.Unmount"}
Jun  6 13:25:40 mesos-slave-1 flocker-docker-plugin: {"task_uuid": "393ab4ac-8753-47cf-b2f0-e0c39252a775", "error": false, "timestamp": 1465208740.879373, "message": "\"-\" - - [06/Jun/2016:10:25:40 +0000] \"POST /VolumeDriver.Unmount HTTP/1.1\" 200 11 \"-\" \"Go 1.1 package http\"", "message_type": "twisted:log", "task_level": [1]}
wallnerryan commented 8 years ago

@gufranmmu can you run flocker-diagnostics and send them to support@clusterhq.com and CC ryan.wallner@clusterhq.com?

Thanks

gufranmmu commented 8 years ago

Done! waiting for your reply now :)

wallnerryan commented 8 years ago

Hi, what i am actually seeing is that flocker is trying to use a volume at sdf (or vda/xvda) and cannot attach. This is related to https://github.com/ClusterHQ/flocker/issues/2772

 flocker:node:agents:blockdevice:aws@1/started
    |-- count: 15303
    |-- operation: [u'_attach_ebs_volume', [u'vol-4171e8c8', u'i-0cbab78c', u'/dev/sdf'], {}]
    `-- timestamp: 2016-06-06 14:28:33.654007
    +-- flocker:node:agents:blockdevice:aws:boto_logs@2
        |-- message: Calling ec2:attach_volume with {'InstanceId': u'i-0cbab78c', 'Device': u'/dev/sdf', u'VolumeId': u'vol-4171e8c8'}
        `-- timestamp: 2016-06-06 14:28:33.656907
    +-- flocker:node:agents:blockdevice:aws@3/failed
        |-- aws_code: InvalidParameterValue
        |-- aws_message: Already have a device attached to: /dev/sdf
        |-- aws_request_id: 4392e7a2-d47a-4776-9385-ccb660a2002f
        |-- exception: botocore.exceptions.ClientError
        |-- reason: An error occurred (InvalidParameterValue) when calling the AttachVolume operation: Already have a device attached to: /dev/sdf
        `-- timestamp: 2016-06-06 14:28:33.687193

b38b8eb0-98f7-4162-8f23-51ce840554a7
+-- eliot:traceback@1
    |-- exception: flocker.node.agents.ebs.AttachFailed
    |-- reason: (u'vol-4171e8c8', u'i-0cbab78c', u'/dev/sdf')
    |-- timestamp: 2016-06-06 14:28:33.687890
    `-- traceback: Traceback: <class 'flocker.node.agents.ebs.AttachFailed'>: (u'vol-4171e8c8', u'i-0cbab78c', u'/dev/sdf')
        /usr/lib64/python2.7/threading.py:811:__bootstrap_inner
        /usr/lib64/python2.7/threading.py:764:run
        /opt/flocker/lib/python2.7/site-packages/twisted/_threads/_threadworker.py:46:work
        /opt/flocker/lib/python2.7/site-packages/twisted/_threads/_team.py:190:doWork
        --- <exception caught here> ---
        /opt/flocker/lib/python2.7/site-packages/twisted/python/threadpool.py:246:inContext
        /opt/flocker/lib/python2.7/site-packages/twisted/python/threadpool.py:262:<lambda>
        /opt/flocker/lib/python2.7/site-packages/twisted/python/context.py:118:callWithContext
        /opt/flocker/lib/python2.7/site-packages/twisted/python/context.py:81:callWithContext
        /opt/flocker/lib/python2.7/site-packages/flocker/node/agents/ebs.py:1296:attach_volume

This attachment error happening on the dataset agent causes the flocker docker plugin to timeout waiting for the Path.

21bd1d9b-00d0-4f89-91bc-1de4e2654c3c
+-- api:request@1/started
    |-- method: POST
    |-- request_path: /VolumeDriver.Path
    `-- timestamp: 2016-06-06 11:32:14.043872
    +-- flocker:apiclient:http_request@3,1/started
        |-- method: GET
        |-- request_body: None
        |-- timestamp: 2016-06-06 11:32:14.044293
        `-- url: https://mesos-master-1:4523/v1/configuration/datasets
        +-- twisted:log@3,3
            |-- error: False
            |-- message: Starting factory <twisted.web.client._HTTP11ClientFactory instance at 0x5e33e18>
            `-- timestamp: 2016-06-06 11:32:14.047729
        +-- flocker:apiclient:http_request@3,4/failed
            |-- exception: twisted.internet.error.TimeoutError
            |-- reason: User timeout caused connection failure.
            `-- timestamp: 2016-06-06 11:32:44.048563
    +-- api:request@4/succeeded
        `-- timestamp: 2016-06-06 11:32:44.049087

22c7e3cd-012f-498a-b9d6-138339a6066b
+-- eliot:traceback@1
    |-- exception: twisted.internet.error.TimeoutError
    |-- reason: User timeout caused connection failure.
    |-- timestamp: 2016-06-06 11:32:44.048840
    `-- traceback: Traceback (failure with no frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused connection failure.

so, if the root cause of the error which is the pre-existing sdf block device is fixed, you should be all set.

wallnerryan commented 8 years ago

Tracking - https://clusterhq.atlassian.net/browse/FLOC-4244

gufranmmu commented 8 years ago

@wallnerryan Thank you very much for investigating the issue! I can not see the bug in the clusterhq Jira board. Has it been created?

Could you please also tell what is the solution for this and when can we expect this to be included in flocker release? There is no /sdf disk in my instances.

wallnerryan commented 8 years ago

@gufranmmu one solution is the manually attach EBS volumes to SDF so flocker does not choose sdf as a valid option. The other is to contact AWS and ask why sdf is hidden. Even though there is no sdf in your host AWS reserves to use it on some nodes and hides it form its users, see https://github.com/ClusterHQ/flocker/issues/2772. Using a different instance type sometimes resolves this. We may look to fix this by not allowing flocker to choose sdf. @wallrj may be able to comment on a suggestion.

gufranmmu commented 8 years ago

@wallnerryan the issue FLOC-4244 does not seem to be available in flocker issue tracker anymore?

wallnerryan commented 8 years ago

it is, just private now.