aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
444 stars 148 forks source link

uninterruptible containers when neurons attached to it #887

Closed oOraph closed 4 months ago

oOraph commented 4 months ago

see:

https://github.com/oOraph/inf2_bug_report

killing a container takes ages, eventually it seems to terminate, see dmesg kernel trace at the end of the file showing that the system hangs on sth

See system setup: inf2.48xlarge with amazon Linux 2 on it integrated within an eks cluster, following the official doc installed packages (kernel version, containerd version..., provided as well)

-> really easy to repoduce, following the official doc (not sure kubernetes has anything to do with the issue :), it's more like a containerd+neuron device interrupts issue) but if you follow the eks doc (see ref file provided, no need for the custom scheduler since you will assign all node neurons to the same pod) then you will eventually reproduce it by spawning/deleting a pod)

jeffhataws commented 4 months ago

Thanks @oOraph for reporting the issue. We will take a look.

awscullinan commented 4 months ago

Relevant parts of dmesg:

[ 1695.831133] INFO: task containerd-shim:28325 blocked for more than 120 seconds.
[ 1695.866877]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1695.902978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1695.940591] task:containerd-shim state:D stack:    0 pid:28325 ppid:     1 flags:0x00004080
[ 1695.979660] Call Trace:
[ 1695.995188]  __schedule+0x1f9/0x660
[ 1696.013672]  ? srso_alias_return_thunk+0x5/0x7f
[ 1696.035109]  schedule+0x46/0xb0
[ 1696.052597]  wb_wait_for_completion+0x51/0x80
[ 1696.073549]  ? add_wait_queue_exclusive+0x70/0x70
[ 1696.095489]  __writeback_inodes_sb_nr+0xa2/0xd0
[ 1696.116948]  sync_filesystem.part.0+0x13/0x80
[ 1696.137921]  ovl_sync_fs+0x59/0x90 [overlay]
[ 1696.158637]  sync_filesystem.part.0+0x68/0x80
[ 1696.179595]  generic_shutdown_super+0x22/0x110
[ 1696.200807]  kill_anon_super+0x14/0x30
[ 1696.220033]  deactivate_locked_super+0x36/0xa0
[ 1696.241228]  cleanup_mnt+0x12d/0x190
[ 1696.259966]  task_work_run+0x5f/0x90
[ 1696.278709]  exit_to_user_mode_loop+0x118/0x120
[ 1696.300164]  exit_to_user_mode_prepare+0x6e/0x80
[ 1696.321867]  syscall_exit_to_user_mode+0x22/0x150
[ 1696.343799]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1696.366741] RIP: 0033:0x403ace
[ 1696.383986] RSP: 002b:000000c000491640 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6
[ 1696.420854] RAX: 0000000000000000 RBX: 000000c000506d80 RCX: 0000000000403ace
[ 1696.449738] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c000506d80
[ 1696.478610] RBP: 000000c000491680 R08: 0000000000000000 R09: 0000000000000000
[ 1696.507467] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
[ 1696.536323] R13: 000000000000001c R14: 000000c00021ad00 R15: 000000c000088400
[ 1696.565259] INFO: task exe:58011 blocked for more than 121 seconds.
[ 1696.591642]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1696.627521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1696.665106] task:exe             state:D stack:    0 pid:58011 ppid: 25583 flags:0x00000080
[ 1696.704206] Call Trace:
[ 1696.719713]  __schedule+0x1f9/0x660
[ 1696.738198]  ? srso_alias_return_thunk+0x5/0x7f
[ 1696.759637]  schedule+0x46/0xb0
[ 1696.777130]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1696.799319]  do_reconfigure_mnt+0x69/0x200
[ 1696.819518]  __x64_sys_mount+0x107/0x140
[ 1696.839221]  do_syscall_64+0x33/0x40
[ 1696.857939]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1696.880855] RIP: 0033:0x7f8f29524eda
[ 1696.899584] RSP: 002b:00007fff8cbadce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1696.936445] RAX: ffffffffffffffda RBX: 00007fff8cbadcf0 RCX: 00007f8f29524eda
[ 1696.965312] RDX: 00005645ca39a591 RSI: 00007fff8cbadcf0 RDI: 00005645ca39a591
[ 1696.994168] RBP: 00005645cb0ff280 R08: 00005645ca39a591 R09: 0000c00000000000
[ 1697.023026] R10: 0000000000001021 R11: 0000000000000246 R12: 00005645cb0ff26a
[ 1697.167311] R13: 0000000000000010 R14: 00005645cb0ff26a R15: 00005645cb0ff280
[ 1697.196183] INFO: task exe:58028 blocked for more than 122 seconds.
[ 1697.222570]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1697.258442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1697.296055] task:exe             state:D stack:    0 pid:58028 ppid: 27783 flags:0x00000080
[ 1697.335146] Call Trace:
[ 1697.350672]  __schedule+0x1f9/0x660
[ 1697.369147]  schedule+0x46/0xb0
[ 1697.386644]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1697.408829]  do_reconfigure_mnt+0x69/0x200
[ 1697.429049]  __x64_sys_mount+0x107/0x140
[ 1697.448775]  do_syscall_64+0x33/0x40
[ 1697.467498]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1697.490141] RIP: 0033:0x7fbe8caf4eda
[ 1697.508851] RSP: 002b:00007ffe3719cd48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1697.545727] RAX: ffffffffffffffda RBX: 00007ffe3719cd50 RCX: 00007fbe8caf4eda
[ 1697.574582] RDX: 000055850c79a591 RSI: 00007ffe3719cd50 RDI: 000055850c79a591
[ 1697.603453] RBP: 000055850e342280 R08: 000055850c79a591 R09: 0000c00000000000
[ 1697.632032] R10: 0000000000001021 R11: 0000000000000246 R12: 000055850e34226a
[ 1697.660863] R13: 0000000000000010 R14: 000055850e34226a R15: 000055850e342280
[ 1697.689706] INFO: task exe:58029 blocked for more than 122 seconds.
[ 1697.716108]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1697.751101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1697.788681] task:exe             state:D stack:    0 pid:58029 ppid: 27783 flags:0x00000080
[ 1697.827774] Call Trace:
[ 1697.843301]  __schedule+0x1f9/0x660
[ 1697.861776]  schedule+0x46/0xb0
[ 1697.879275]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1697.901461]  do_reconfigure_mnt+0x69/0x200
[ 1697.921669]  __x64_sys_mount+0x107/0x140
[ 1697.941389]  do_syscall_64+0x33/0x40
[ 1697.960121]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1697.983042] RIP: 0033:0x7f99ed0e3eda
[ 1698.001772] RSP: 002b:00007fff45cdd488 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1698.038636] RAX: ffffffffffffffda RBX: 00007fff45cdd490 RCX: 00007f99ed0e3eda
[ 1698.067500] RDX: 00005608ae99a591 RSI: 00007fff45cdd490 RDI: 00005608ae99a591
[ 1698.096362] RBP: 00005608aefea280 R08: 00005608ae99a591 R09: 0000c00000000000
[ 1698.125217] R10: 0000000000001021 R11: 0000000000000246 R12: 00005608aefea26a
[ 1698.154079] R13: 0000000000000010 R14: 00005608aefea26a R15: 00005608aefea280
[ 1698.182951] INFO: task exe:58045 blocked for more than 123 seconds.
[ 1698.209348]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1698.245353] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1698.283087] task:exe             state:D stack:    0 pid:58045 ppid: 25583 flags:0x00000080
[ 1698.322305] Call Trace:
[ 1698.337861]  __schedule+0x1f9/0x660
[ 1698.356410]  schedule+0x46/0xb0
[ 1698.373963]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1698.396264]  do_reconfigure_mnt+0x69/0x200
[ 1698.416534]  __x64_sys_mount+0x107/0x140
[ 1698.436301]  do_syscall_64+0x33/0x40
[ 1698.455084]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1698.478068] RIP: 0033:0x7f5bc10bbeda
[ 1698.496853] RSP: 002b:00007ffcbd5c6e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1698.533785] RAX: ffffffffffffffda RBX: 00007ffcbd5c6e50 RCX: 00007f5bc10bbeda
[ 1698.562710] RDX: 000055c6f519a591 RSI: 00007ffcbd5c6e50 RDI: 000055c6f519a591
[ 1698.591663] RBP: 000055c6f647f280 R08: 000055c6f519a591 R09: 0000c00000000000
[ 1698.620598] R10: 0000000000001021 R11: 0000000000000246 R12: 000055c6f647f26a
[ 1698.649446] R13: 0000000000000010 R14: 000055c6f647f26a R15: 000055c6f647f280
[ 1698.677984] INFO: task exe:58056 blocked for more than 123 seconds.
[ 1698.703921]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1698.739783] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1698.777329] task:exe             state:D stack:    0 pid:58056 ppid: 27361 flags:0x00000080
[ 1698.816500] Call Trace:
[ 1698.832027]  __schedule+0x1f9/0x660
[ 1698.850531]  schedule+0x46/0xb0
[ 1698.868063]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1698.890296]  do_reconfigure_mnt+0x69/0x200
[ 1698.910560]  __x64_sys_mount+0x107/0x140
[ 1698.930315]  do_syscall_64+0x33/0x40
[ 1698.949097]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1698.972093] RIP: 0033:0x7f9b03cb4eda
[ 1698.990860] RSP: 002b:00007ffc13faffc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1699.027792] RAX: ffffffffffffffda RBX: 00007ffc13faffd0 RCX: 00007f9b03cb4eda
[ 1699.056653] RDX: 000056372139a591 RSI: 00007ffc13faffd0 RDI: 000056372139a591
[ 1699.085587] RBP: 000056372270e280 R08: 000056372139a591 R09: 0000c00000000000
[ 1699.114497] R10: 0000000000001021 R11: 0000000000000246 R12: 000056372270e26a
[ 1699.143416] R13: 0000000000000010 R14: 000056372270e26a R15: 000056372270e280
[ 1699.172335] INFO: task exe:58102 blocked for more than 124 seconds.
[ 1699.198689]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1699.234717] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1699.272425] task:exe             state:D stack:    0 pid:58102 ppid: 25580 flags:0x00000080
[ 1699.311674] Call Trace:
[ 1699.327245]  __schedule+0x1f9/0x660
[ 1699.345771]  schedule+0x46/0xb0
[ 1699.363308]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1699.385549]  do_reconfigure_mnt+0x69/0x200
[ 1699.405811]  __x64_sys_mount+0x107/0x140
[ 1699.425376]  do_syscall_64+0x33/0x40
[ 1699.444166]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1699.467161] RIP: 0033:0x7f0229909eda
[ 1699.485932] RSP: 002b:00007ffe11616e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1699.522915] RAX: ffffffffffffffda RBX: 00007ffe11616e70 RCX: 00007f0229909eda
[ 1699.551874] RDX: 000055eff099a591 RSI: 00007ffe11616e70 RDI: 000055eff099a591
[ 1699.580815] RBP: 000055eff11d9280 R08: 000055eff099a591 R09: 0000c00000000000
[ 1699.609778] R10: 0000000000001021 R11: 0000000000000246 R12: 000055eff11d926a
[ 1699.638751] R13: 0000000000000010 R14: 000055eff11d926a R15: 000055eff11d9280
[ 1820.756866] INFO: task containerd-shim:28325 blocked for more than 245 seconds.
[ 1820.793619]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1820.830046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1820.868301] task:containerd-shim state:D stack:    0 pid:28325 ppid:     1 flags:0x00004080
[ 1820.908066] Call Trace:
[ 1820.923883]  __schedule+0x1f9/0x660
[ 1820.942632]  ? srso_alias_return_thunk+0x5/0x7f
[ 1820.964460]  schedule+0x46/0xb0
[ 1820.982211]  wb_wait_for_completion+0x51/0x80
[ 1821.003437]  ? add_wait_queue_exclusive+0x70/0x70
[ 1821.025637]  __writeback_inodes_sb_nr+0xa2/0xd0
[ 1821.047471]  sync_filesystem.part.0+0x13/0x80
[ 1821.068766]  ovl_sync_fs+0x59/0x90 [overlay]
[ 1821.089806]  sync_filesystem.part.0+0x68/0x80
[ 1821.111051]  generic_shutdown_super+0x22/0x110
[ 1821.132618]  kill_anon_super+0x14/0x30
[ 1821.152183]  deactivate_locked_super+0x36/0xa0
[ 1821.173648]  cleanup_mnt+0x12d/0x190
[ 1821.192638]  task_work_run+0x5f/0x90
[ 1821.211692]  exit_to_user_mode_loop+0x118/0x120
[ 1821.233506]  exit_to_user_mode_prepare+0x6e/0x80
[ 1821.255515]  syscall_exit_to_user_mode+0x22/0x150
[ 1821.277736]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1821.301038] RIP: 0033:0x403ace
[ 1821.318446] RSP: 002b:000000c000491640 EFLAGS: 00000206 ORIG_RAX: 00000000000000a6
[ 1821.355936] RAX: 0000000000000000 RBX: 000000c000506d80 RCX: 0000000000403ace
[ 1821.385220] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000c000506d80
[ 1821.414493] RBP: 000000c000491680 R08: 0000000000000000 R09: 0000000000000000
[ 1821.443761] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
[ 1821.473013] R13: 000000000000001c R14: 000000c00021ad00 R15: 000000c000088400
[ 1821.502214] INFO: task exe:58011 blocked for more than 246 seconds.
[ 1821.528791]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1821.565140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1821.602430] task:exe             state:D stack:    0 pid:58011 ppid: 25583 flags:0x00000080
[ 1821.638208] Call Trace:
[ 1821.652464]  __schedule+0x1f9/0x660
[ 1821.669365]  ? srso_alias_return_thunk+0x5/0x7f
[ 1821.689046]  schedule+0x46/0xb0
[ 1821.705113]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1821.725473]  do_reconfigure_mnt+0x69/0x200
[ 1821.744026]  __x64_sys_mount+0x107/0x140
[ 1821.762138]  do_syscall_64+0x33/0x40
[ 1821.779330]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1821.800310] RIP: 0033:0x7f8f29524eda
[ 1821.817509] RSP: 002b:00007fff8cbadce8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1821.851318] RAX: ffffffffffffffda RBX: 00007fff8cbadcf0 RCX: 00007f8f29524eda
[ 1821.877878] RDX: 00005645ca39a591 RSI: 00007fff8cbadcf0 RDI: 00005645ca39a591
[ 1821.904345] RBP: 00005645cb0ff280 R08: 00005645ca39a591 R09: 0000c00000000000
[ 1821.930886] R10: 0000000000001021 R11: 0000000000000246 R12: 00005645cb0ff26a
[ 1821.957398] R13: 0000000000000010 R14: 00005645cb0ff26a R15: 00005645cb0ff280
[ 1821.983912] INFO: task exe:58028 blocked for more than 246 seconds.
[ 1822.008167]       Tainted: G           OE     5.10.213-201.855.amzn2.x86_64 #1
[ 1822.043274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1822.190776] task:exe             state:D stack:    0 pid:58028 ppid: 27783 flags:0x00000080
[ 1822.226523] Call Trace:
[ 1822.240719]  __schedule+0x1f9/0x660
[ 1822.257639]  schedule+0x46/0xb0
[ 1822.273647]  rwsem_down_write_slowpath+0x24b/0x4c0
[ 1822.293949]  do_reconfigure_mnt+0x69/0x200
[ 1822.312452]  __x64_sys_mount+0x107/0x140
[ 1822.330472]  do_syscall_64+0x33/0x40
[ 1822.347551]  entry_SYSCALL_64_after_hwframe+0x62/0xc7
[ 1822.368477] RIP: 0033:0x7fbe8caf4eda
[ 1822.385559] RSP: 002b:00007ffe3719cd48 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[ 1822.419179] RAX: ffffffffffffffda RBX: 00007ffe3719cd50 RCX: 00007fbe8caf4eda
[ 1822.445523] RDX: 000055850c79a591 RSI: 00007ffe3719cd50 RDI: 000055850c79a591
[ 1822.471826] RBP: 000055850e342280 R08: 000055850c79a591 R09: 0000c00000000000
[ 1822.498160] R10: 0000000000001021 R11: 0000000000000246 R12: 000055850e34226a
[ 1822.524501] R13: 0000000000000010 R14: 000055850e34226a R15: 000055850e342280

Seems like all these hangs are in filesystem operations (mount or sync), not obviously Neuron releated.

geetasg commented 4 months ago

@oOraph can you please share the pod yaml sample ?

oOraph commented 4 months ago

some additional info provided:

oOraph commented 4 months ago

If I have time today I will make a test with and without neuron drivers on inf2 instances + with different instance types (with or without neurons) with this very same sample to try and get more info (about pods exec/terminating timings)

oOraph commented 4 months ago

Thanks for you reactivity. Closing: turns out I was wrong: not related to neurons at all, we only encountered this on inf2 instances because when downloading such volumes we usually don't use ebs to store ephemeral. But on inf2 instances we had no local ssds at our disposal. And the gp3 ebs was not tuned. The default iops and throughput provisioning were not enough for our usage.

So as you suggested and the dmesg showed it was more an fs/io issue

Could be good to provide some local ssds to inf2 instances at some point just like there are some for gpu instances :) (don't know if this is the correct place to request it though)

Thanks