Open cravler opened 6 years ago
I'm having the same issue with my actual stack in aws. it's happening since AWS 18.06.1-CE was upgraded.
Same issue here.. for us cloudstor seems to break when using backing=shared
..
We downgraded to 18.03 and it works well now!
Has anyone figured out how to fix this? Appears we're getting bit with it now.
@leostarcevic how did you go about downgrading?
@mateodelnorte I basically just rolled back to the 18.03 AMI-IDs. I've been saving previous releases in our repository, because Docker only provides the latest release AFAIK. Let me know if you need help
any update on solving this?
they just link the latest template from the site, but there are all versions in the bucket. The version that works for us is on https://editions-us-east-1.s3.amazonaws.com/aws/stable/18.03.0/Docker.tmpl
But the diff is only in new condition EFSEncrypted
and new instance types from m5
and c5
family. And engine 18.06 - so the breaking change may be there.
FYI still don't work with last version 18.09
: https://editions-us-east-1.s3.amazonaws.com/aws/stable/18.09.2/Docker-no-vpc.tmpl
Shame that they don't give a fuck with even simple response so we know where we stand. Really like 6 months without anything ? .. Time to switch to rexray, period ..
Anyone find a solution to this yet? I mounted the host log directory inside a container and didn’t see anything that meaningful (lots of timeouts). I’d really like to not be vulnerable to CVE-2019-5736... I thought this template was supposed to be “baked and tested...”
From the kernel logs:
Mar 31 01:47:13 moby kernel: INFO: task portainer:4823 blocked for more than
120 seconds.
Mar 31 01:47:13 moby kernel: Not tainted 4.9.114-moby #1
Mar 31 01:47:13 moby kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Mar 31 01:47:13 moby kernel: portainer D 0 4823 4786 0x00000100
Mar 31 01:47:13 moby kernel: 00000000000190c0 0000000000000000
ffffa02c63a637c0 ffffa02c734821c0
Mar 31 01:47:13 moby kernel: ffffa02c63a80d00 ffffa02c762190c0
ffffffff8a83caf6 0000000000000002
Mar 31 01:47:13 moby kernel: ffffa02c63a80d00 ffffc1f4412dfce0
7fffffffffffffff 0000000000000002
Mar 31 01:47:13 moby kernel: Call Trace:
Mar 31 01:47:13 moby kernel: [
Digging in a little further, I tried spinning up yet another brand new stack with the "Encrypt EFS" option turned on. Still no love. Also, it looks like I can mount the EFS volume (and see/inspect its contents) in a manger node that isn't trying to run a container that requires access to the volume. Any such interaction from a manager node that is trying to run a container that has that volume mapped hangs, and that container is completely unresponsive.
So there doesn't appear to be anything wrong with EFS. Also, containers that don't rely on EFS work just fine. Seems like it's the plugin at fault here. Does anyone know where or if the code for the plugin is available somewhere?
@jderusse @paullj1 What were your test cases? Can you provide number of files and directories?
I'm trying out the Docker 18.09.2 AMI:s with T3 instances and I've created file sizes from 1MB up to 1GB with Cloudstore/EFS and can't see any problems. Swarm exists of 3 managers and 3 workers.
Thanks for looking into this. I actually never got any data written to the volume... just a few directories and empty files created. I stood up a brand new stack with three manager nodes, and one worker node. Then I created an empty volume. Then I tried to start the default portainer stack (as well as a few other various services). The container apparently created a few directories and empty files, but otherwise hung indefinitely. Any subsequent attempts to interact with that volume hang indefinitely. I could go to another node, and see the volume.
Thoughts?
On Thu, Apr 4, 2019 at 14:53 MikaHjalmarsson notifications@github.com wrote:
@jderusse https://github.com/jderusse @paullj1 https://github.com/paullj1 What were your test cases? Can you provide number of files and directories?
I'm trying out the Docker 18.09.2 AMI:s with T3 instances and I've created file sizes from 1MB up to 1GB with Cloudstore/EFS and can't see any problems. Swarm exists of 3 managers and 3 workers.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/177#issuecomment-480039810, or mute the thread https://github.com/notifications/unsubscribe-auth/AIb5BxbDAp1VfIU8XumtebH8LxSBpkR4ks5vdlhAgaJpZM4Xd5px .
Hi this isn't working for me either, slightly different error than others have mentioned:
create scaling_mysql_data: VolumeDriver.Create: EFS support necessary for backing type: "shared"
I have created the cluster with the appropriate EFS setting:
/home/docker # docker plugin ls
ID NAME DESCRIPTION ENABLED
f16ca966fda3 cloudstor:aws cloud storage plugin for Docker true
And specified the proper mount config in a compose file:
volumes:
mysql_data:
driver: "cloudstor:aws"
driver_opts:
backing: shared
This is happening with the latest version:
/home/docker # docker info
Containers: 7
Running: 4
Paused: 0
Stopped: 3
Images: 6
Server Version: 18.09.2
Storage Driver: overlay2
Backing Filesystem: tmpfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: d7q22xogoz6jk4rw5v9ps3t2l
Is Manager: true
ClusterID: t903fvdnxiwtk1xvs2xacg7g6
Managers: 1
Nodes: 6
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 172.31.28.163
Manager Addresses:
172.31.28.163:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.114-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.2GiB
Name: ip-172-31-28-163.us-west-1.compute.internal
ID: VKSX:YNVB:V3QQ:4W7F:FLOP:GUWZ:2MFB:LRWM:B5F4:6RQA:ABA7:56CS
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
os=linux
region=us-west-1
availability_zone=us-west-1c
instance_type=m5.xlarge
node_type=manager
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
I'm seeing this issue too.
My test setup is simple:
version: "3.7"
services:
test:
image: alpine
command: "sh -c 'sleep 900'"
volumes:
- teststorage:/mnt
deploy:
restart_policy:
condition: none
volumes:
teststorage:
driver: "cloudstor:aws"
driver_opts:
backing: "shared"
I then exec into the running container and try dd if=/dev/urandom of=/mnt/test.file bs=1M count=1
It will hang. Syslog reveals some information, but I think I have a line from the NFS module that is missing otherwise:
Jun 20 02:00:01 moby syslogd 1.5.1: restart.
Jun 20 02:03:49 moby kernel: nfs: <<my EFS DNS name>> not responding, still trying
Jun 20 02:04:05 moby kernel: INFO: task dd:7813 blocked for more than 120 seconds.
Jun 20 02:04:05 moby kernel: Not tainted 4.9.114-moby #1
Jun 20 02:04:05 moby kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 20 02:04:05 moby kernel: dd D 0 7813 7188 0x00000100
Jun 20 02:04:05 moby kernel: 00000000000190c0 0000000000000000 ffff9ce08fbdb7c0 ffff9ce0a5dd8100
Jun 20 02:04:05 moby kernel: ffff9ce0a30ea040 ffff9ce0b62190c0 ffffffff8d83caf6 0000000000000002
Jun 20 02:04:05 moby kernel: ffff9ce0a30ea040 ffffc1064117bce0 7fffffffffffffff 0000000000000002
Jun 20 02:04:05 moby kernel: Call Trace:
Jun 20 02:04:05 moby kernel: [<ffffffff8d83caf6>] ? __schedule+0x35f/0x43d
Jun 20 02:04:05 moby kernel: [<ffffffff8d83cf26>] ? bit_wait+0x2a/0x2a
Jun 20 02:04:05 moby kernel: [<ffffffff8d83cc52>] ? schedule+0x7e/0x87
Jun 20 02:04:05 moby kernel: [<ffffffff8d83e8de>] ? schedule_timeout+0x43/0x101
Jun 20 02:04:05 moby kernel: [<ffffffff8d019808>] ? xen_clocksource_read+0x11/0x12
Jun 20 02:04:05 moby kernel: [<ffffffff8d12e281>] ? timekeeping_get_ns+0x19/0x2c
Jun 20 02:04:05 moby kernel: [<ffffffff8d83c739>] ? io_schedule_timeout+0x99/0xf7
Jun 20 02:04:05 moby kernel: [<ffffffff8d83c739>] ? io_schedule_timeout+0x99/0xf7
Jun 20 02:04:05 moby kernel: [<ffffffff8d83cf3d>] ? bit_wait_io+0x17/0x34
Jun 20 02:04:05 moby kernel: [<ffffffff8d83d009>] ? __wait_on_bit+0x48/0x76
Jun 20 02:04:05 moby kernel: [<ffffffff8d19e758>] ? wait_on_page_bit+0x7c/0x96
Jun 20 02:04:05 moby kernel: [<ffffffff8d10f99e>] ? autoremove_wake_function+0x35/0x35
Jun 20 02:04:05 moby kernel: [<ffffffff8d19e842>] ? __filemap_fdatawait_range+0xd0/0x12b
Jun 20 02:04:05 moby kernel: [<ffffffff8d19e8ac>] ? filemap_fdatawait_range+0xf/0x23
Jun 20 02:04:05 moby kernel: [<ffffffff8d1a060c>] ? filemap_write_and_wait_range+0x3a/0x4f
Jun 20 02:04:05 moby kernel: [<ffffffff8d2bcf98>] ? nfs_file_fsync+0x54/0x187
Jun 20 02:04:05 moby kernel: [<ffffffff8d1f6c4d>] ? filp_close+0x39/0x66
Jun 20 02:04:05 moby kernel: [<ffffffff8d1f6c99>] ? SyS_close+0x1f/0x47
Jun 20 02:04:05 moby kernel: [<ffffffff8d0033b7>] ? do_syscall_64+0x69/0x79
Jun 20 02:04:05 moby kernel: [<ffffffff8d83f64e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
This is a single-node (manager only) deployment for testing, although I've created similar in a larger scale setup and seen it as well (when I ran into the problem in the first instance). It's running in ap-southeast-1
, on a modified template, because the last template update was just after EFS was released into the AP region. I will, when I have time, see if I can replicate the behaviour in another region.
I wonder if this is related to the mount options, such as noresvport
not being set? More info here: https://forums.aws.amazon.com/message.jspa?messageID=812356#882043
I cannot see the mount options used by the opaque cloudstor:aws
plugin, so it's hard to say.
Given this issue has been open a long time, if the developers aren't able to support it, perhaps they should consider open sourcing it instead, or at least indicate if EE is similarly affected?
edit: Just to add a bit more information. I can write varying amounts of data before it dies, even with conv=fsync
set with dd.
Also, I have found the mount options in the log:
rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.31.1.178,local_lock=none,addr=172.31.5.191
I note that noresvport
isn't there, but remain unsure if it's anything to do with the issue. A working theory would be a reconnect event takes place to handle the write load, but that makes a big assumption about how and when EFS does that sort of thing.
Wow, I’m glad I’m not the only one seeing this. I don’t think it’s one of the mount options... I’ve played pretty extensively with those. I have been able to get an EFS volume mounted, and create directories, and files in it with no issues. I’ve even been able to add small contents to files (echo “Hello world” > .keep). The problem seems to come in when you write lots of data... or maybe it’s binary data causing the issue?
On Wed, Jun 19, 2019 at 21:36 Steve Kerrison notifications@github.com wrote:
I'm seeing this issue too.
My test setup is simple:
version: "3.7"
services: test: image: alpine command: "sh -c 'sleep 900'" volumes:
- teststorage:/mnt deploy: restart_policy: condition: none
volumes: teststorage: driver: "cloudstor:aws" driver_opts: backing: "shared"
I then exec into the running container and try dd if=/dev/urandom of=/mnt/test.file bs=1M count=1
It will hang. Syslog reveals some information, but I think I have a line from the NFS module that is missing otherwise:
Jun 20 02:00:01 moby syslogd 1.5.1: restart. Jun 20 02:03:49 moby kernel: nfs: <
> not responding, still trying Jun 20 02:04:05 moby kernel: INFO: task dd:7813 blocked for more than 120 seconds. Jun 20 02:04:05 moby kernel: Not tainted 4.9.114-moby #1 Jun 20 02:04:05 moby kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jun 20 02:04:05 moby kernel: dd D 0 7813 7188 0x00000100 Jun 20 02:04:05 moby kernel: 00000000000190c0 0000000000000000 ffff9ce08fbdb7c0 ffff9ce0a5dd8100 Jun 20 02:04:05 moby kernel: ffff9ce0a30ea040 ffff9ce0b62190c0 ffffffff8d83caf6 0000000000000002 Jun 20 02:04:05 moby kernel: ffff9ce0a30ea040 ffffc1064117bce0 7fffffffffffffff 0000000000000002 Jun 20 02:04:05 moby kernel: Call Trace: Jun 20 02:04:05 moby kernel: [ ] ? schedule+0x35f/0x43d Jun 20 02:04:05 moby kernel: [ ] ? bit_wait+0x2a/0x2a Jun 20 02:04:05 moby kernel: [ wait_on_bit+0x48/0x76 Jun 20 02:04:05 moby kernel: [] ? schedule+0x7e/0x87 Jun 20 02:04:05 moby kernel: [ ] ? schedule_timeout+0x43/0x101 Jun 20 02:04:05 moby kernel: [ ] ? xen_clocksource_read+0x11/0x12 Jun 20 02:04:05 moby kernel: [ ] ? timekeeping_get_ns+0x19/0x2c Jun 20 02:04:05 moby kernel: [ ] ? io_schedule_timeout+0x99/0xf7 Jun 20 02:04:05 moby kernel: [ ] ? io_schedule_timeout+0x99/0xf7 Jun 20 02:04:05 moby kernel: [ ] ? bit_wait_io+0x17/0x34 Jun 20 02:04:05 moby kernel: [ ] ? ] ? wait_on_page_bit+0x7c/0x96 Jun 20 02:04:05 moby kernel: [ ] ? autoremove_wake_function+0x35/0x35 Jun 20 02:04:05 moby kernel: [ ] ? __filemap_fdatawait_range+0xd0/0x12b Jun 20 02:04:05 moby kernel: [ ] ? filemap_fdatawait_range+0xf/0x23 Jun 20 02:04:05 moby kernel: [ ] ? filemap_write_and_wait_range+0x3a/0x4f Jun 20 02:04:05 moby kernel: [ ] ? nfs_file_fsync+0x54/0x187 Jun 20 02:04:05 moby kernel: [ ] ? filp_close+0x39/0x66 Jun 20 02:04:05 moby kernel: [ ] ? SyS_close+0x1f/0x47 Jun 20 02:04:05 moby kernel: [ ] ? do_syscall_64+0x69/0x79 Jun 20 02:04:05 moby kernel: [ ] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6 This is a single-node (manager only) deployment for testing, although I've created similar in a larger scale setup and seen it as well (when I ran into the problem in the first instance). It's running in ap-southeast-1, on a modified template, because the last template update was just after EFS was released into the AP region. I will, when I have time, see if I can replicate the behaviour in another region.
I wonder if this is related to the mount options, such as noresvport not being set? More info here: https://forums.aws.amazon.com/message.jspa?messageID=812356#882043
I cannot see the mount options used by the opaque cloudstor:aws plugin, so it's hard to say.
Given this issue has been open a long time, if the developers aren't able to support it, perhaps they should consider open sourcing it instead, or at least indicate if EE is similarly affected?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/177?email_source=notifications&email_token=ACDPSB2LQ7KXBWIDBNXQB7LP3LUMNA5CNFSM4F3XTJY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYD2JYA#issuecomment-503817440, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDPSBYWE4A4K75YCKPYZTTP3LUMNANCNFSM4F3XTJYQ .
Hi @paullj1,
OK that's interesting. Thanks for the extra data points.
How did you test the mount options? In my view it's not possible to tweak how cloudstor mounts the EFS volume it will attach to the container. If you tested those options separately it might not be a fair comparison.
I specified them in the compose file in the mount options https://forums.docker.com/t/how-to-mount-nfs-drive-in-container-simplest-way/46699 (which I believe is the only thing that the cloudstor plugin does, but obviously cannot confirm since the source is nowhere to be found). Understood it may not be totally fair, but in each case, the volumes showed up as cloudstor volumes when I did a volume list... also, if the options specified by the cloudstor plugin don't work at all, I'm not sure how else to troubleshoot.
On Wed, Jul 3, 2019 at 1:44 AM Steve Kerrison notifications@github.com wrote:
Hi @paullj1 https://github.com/paullj1,
OK that's interesting. Thanks for the extra data points.
How did you test the mount options? In my view it's not possible to tweak how cloudstor mounts the EFS volume it will attach to the container. If you tested those options separately it might not be a fair comparison.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/177?email_source=notifications&email_token=ACDPSB3PE3D2Z5OJ6BJF7FDP5RDFZA5CNFSM4F3XTJY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZDOLGI#issuecomment-507962777, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDPSB2R3GPEJJOUYX2QRY3P5RDFZANCNFSM4F3XTJYQ .
We are having the exact same issue with ECS mounting to efs volumes. Looks like mount fails/recovers intermittently which causes containers that mounts to the efs to fail with following error.
Jul 4 14:41:58 ip-10-84-209-173 kernel: INFO: task java:12311 blocked for more than 120 seconds.
Jul 4 14:41:58 ip-10-84-209-173 kernel: Not tainted 4.14.123-111.109.amzn2.x86_64 #1
Jul 4 14:41:58 ip-10-84-209-173 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 4 14:41:58 ip-10-84-209-173 kernel: java D 0 12311 12210 0x00000184
Jul 4 14:41:58 ip-10-84-209-173 kernel: Call Trace:
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? __schedule+0x28e/0x890
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? __switch_to_asm+0x41/0x70
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? __switch_to_asm+0x35/0x70
Jul 4 14:41:58 ip-10-84-209-173 kernel: schedule+0x28/0x80
Jul 4 14:41:58 ip-10-84-209-173 kernel: io_schedule+0x12/0x40
Jul 4 14:41:58 ip-10-84-209-173 kernel: __lock_page+0x115/0x160
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? page_cache_tree_insert+0xc0/0xc0
Jul 4 14:41:58 ip-10-84-209-173 kernel: nfs_vm_page_mkwrite+0x212/0x280 [nfs]
Jul 4 14:41:58 ip-10-84-209-173 kernel: do_page_mkwrite+0x31/0x90
Jul 4 14:41:58 ip-10-84-209-173 kernel: do_wp_page+0x223/0x540
Jul 4 14:41:58 ip-10-84-209-173 kernel: __handle_mm_fault+0xa1c/0x12b0
Jul 4 14:41:58 ip-10-84-209-173 kernel: handle_mm_fault+0xaa/0x1e0
Jul 4 14:41:58 ip-10-84-209-173 kernel: __do_page_fault+0x23e/0x4c0
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? async_page_fault+0x2f/0x50
Jul 4 14:41:58 ip-10-84-209-173 kernel: async_page_fault+0x45/0x50
Jul 4 14:41:58 ip-10-84-209-173 kernel: RIP: 2b78a3d8:0x7fa624078000
Jul 4 14:41:58 ip-10-84-209-173 kernel: RSP: 2400a660:00007fa6145fbb00 EFLAGS: 00000000
Jul 4 14:41:58 ip-10-84-209-173 kernel: INFO: task java:12315 blocked for more than 120 seconds.
Jul 4 14:41:58 ip-10-84-209-173 kernel: Not tainted 4.14.123-111.109.amzn2.x86_64 #1
Jul 4 14:41:58 ip-10-84-209-173 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 4 14:41:58 ip-10-84-209-173 kernel: java D 0 12315 12210 0x00000184
Jul 4 14:41:58 ip-10-84-209-173 kernel: Call Trace:
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? __schedule+0x28e/0x890
Jul 4 14:41:58 ip-10-84-209-173 kernel: schedule+0x28/0x80
Jul 4 14:41:58 ip-10-84-209-173 kernel: io_schedule+0x12/0x40
Jul 4 14:41:58 ip-10-84-209-173 kernel: __lock_page+0x115/0x160
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? page_cache_tree_insert+0xc0/0xc0
Jul 4 14:41:58 ip-10-84-209-173 kernel: nfs_vm_page_mkwrite+0x212/0x280 [nfs]
Jul 4 14:41:58 ip-10-84-209-173 kernel: do_page_mkwrite+0x31/0x90
Jul 4 14:41:58 ip-10-84-209-173 kernel: do_wp_page+0x223/0x540
Jul 4 14:41:58 ip-10-84-209-173 kernel: __handle_mm_fault+0xa1c/0x12b0
Jul 4 14:41:58 ip-10-84-209-173 kernel: handle_mm_fault+0xaa/0x1e0
Jul 4 14:41:58 ip-10-84-209-173 kernel: __do_page_fault+0x23e/0x4c0
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? async_page_fault+0x2f/0x50
Jul 4 14:41:58 ip-10-84-209-173 kernel: async_page_fault+0x45/0x50
Jul 4 14:41:58 ip-10-84-209-173 kernel: RIP: 240b7800:0x7fa5fc15da00
Jul 4 14:41:58 ip-10-84-209-173 kernel: RSP: 240b75a0:00007fa6141f7af0 EFLAGS: 7fa62b74ead8
Jul 4 14:41:58 ip-10-84-209-173 kernel: INFO: task java:12316 blocked for more than 120 seconds.
Jul 4 14:41:58 ip-10-84-209-173 kernel: Not tainted 4.14.123-111.109.amzn2.x86_64 #1
Jul 4 14:41:58 ip-10-84-209-173 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 4 14:41:58 ip-10-84-209-173 kernel: java D 0 12316 12210 0x00000184
Jul 4 14:41:58 ip-10-84-209-173 kernel: Call Trace:
Jul 4 14:41:58 ip-10-84-209-173 kernel: ? __schedule+0x28e/0x890
Docker version 18.06.1-ce, build e68fc7a215d7133c34aa18e3b72b4a21fd0c6136
@paullj1 I suspect the cloustor driver might not treat your mount options the same way, but I'm not sure. If those options get ignored, then all bets are off.
@serkanh if you get intermittent errors, it might still be perpetuated by my hunch. Or it might not.
I'd consider offering a bounty for this, but there's little point when the only people with access to the code don't seem to even look at their issues lists...
@paullj1 I'm sorry I re-read your message and see you were using pure NFS on a local mount. I may also do some experiments along those lines when I get a chance.
@stevekerrison, no worries! There has to be a combination of options that work, I just haven't found it yet. Once those options are found, I suspect the only thing that will need to change for the Cloudstor plugin to work, is those options.
@serkanh, I see the same thing in my logs (syslog, and dmesg). It's not that it's failing intermittently, it's that it periodically updates you on its failure to mount the share. Since mounting a disk mostly happens in kernel space, the kernel is letting you know that it has a hung task. Those messages should appear every 2 minutes.
I ran a test similar to yours, and get the same failures. I mounted a local nfs mount, using docker-compose, in swarm mode, attached to the EFS volume that's supposed to be used by CloudStor. I used these options:
nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport
What I did notice was that upon creating a new directory, it appeared in my swarm's docker volume ls
as a cloudstor:aws
volume (the volumes are just subdirectories of the EFS volume). In fact if you inspect EFS cloudstor mounts you'll see they go into /mnt/efs/{mode}/{name}
where {mode}
differentiates between regular and maxIO.
So I suspect that some part of the cloudstor plugin is interfering with my NFS mount. I'd be interested to see how the system handles NFS-mounted EFS volumes if cloudstor's EFS support is disabled. Alas, I don't know if the cloudformation without EFS will include the NFS drivers or not, as I've not dug that deep.
Yup. I see the same thing. I have taken it further and deployed a swarm without EFS/Cloudstor support, made a manual EFS volume, then mounted it like you describe and had the same issues. So, can confirm it isn’t Cloudstor messing anything up. I suspect it’s just the EFS options. We’ve got to find which options cause it to hang.
On Wed, Jul 10, 2019 at 02:12 Steve Kerrison notifications@github.com wrote:
I ran a test similar to yours, and get the same failures. I mounted a local nfs mount, using docker-compose, in swarm mode, attached to the EFS volume that's supposed to be used by CloudStor. I used these options:
nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport
What I did notice was that upon creating a new directory, it appeared in my swarm's docker volume ls as a cloudstor:aws volume (the volumes are just subdirectories of the EFS volume). In fact if you inspect EFS cloudstor mounts you'll see they go into /mnt/efs/{mode}/{name} where {mode} differentiates between regular and maxIO.
So I suspect that some part of the cloudstor plugin is interfering with my NFS mount. I'd be interested to see how the system handles NFS-mounted EFS volumes if cloudstor's EFS support is disabled. Alas, I don't know if the cloudformation without EFS will include the NFS drivers or not, as I've not dug that deep.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/for-aws/issues/177?email_source=notifications&email_token=ACDPSB4WWI6Y4MD2DDMXSNTP6WDVZA5CNFSM4F3XTJY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSQ6KY#issuecomment-509939499, or mute the thread https://github.com/notifications/unsubscribe-auth/ACDPSB5EBLQQJ4OXUX5LJTTP6WDVZANCNFSM4F3XTJYQ .
More testing... it doesn't look like it's the options. I looked at the EFS options from one of my other swarms (using an older template where Cloudstor actually works), and they're identical. The delta might be that they added the "encryption" option? Maybe that's causing issues? To recap:
echo 'asdf' > asdf
)dd if=/dev/urandom of=./test.bin count=10 bs=1M
Expected behavior
Copying data to volume must work.
Actual behavior
Copying data to volume just froze the stack and only restart helps.
Information
yes
Steps to reproduce the behavior