Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.63k stars 2.83k forks source link

Cannot figure the cause of code failure due to Idle Shutdown enabled #28182

Closed monajalal closed 1 year ago

monajalal commented 1 year ago

I ran a training code inside an Azure Compute node with 4 GPUs where the input data was read from Azure DataStore blob storage and the output data was written unto the same blob storage.

My code ran for 11 epochs and stopped and since the node was idle for 1 hour the node stopped since I had activated the idle shutdown.

Since I ran the code inside a tmux session, and I expected it to run for 60 epochs, I cannot identify the source of error. Is there anyway you could assist me figure the source of failure?

$ lsb_release -a
LSB Version:    core-11.1.0ubuntu2-noarch:security-11.1.0ubuntu2-noarch
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.5 LTS
Release:    20.04
Codename:   focal
$ uname -a
Linux mona-4gpu 5.15.0-1022-azure #27~20.04.1-Ubuntu SMP Mon Oct 17 02:03:50 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I am running the same exact train code with same data on a local server using a tmux session and I don't have any problem and it's on epoch 30 now.

Train Epoch: 30 [0/50000 (0%)]  Loss: 0.002366828266531
Train Epoch: 30 [800/50000 (2%)]        Loss: 0.001592643908225
Train Epoch: 30 [1600/50000 (3%)]       Loss: 0.001973111648113
Train Epoch: 30 [2400/50000 (5%)]       Loss: 0.001397836604156
Train Epoch: 30 [3200/50000 (6%)]       Loss: 0.001601065509021
Train Epoch: 30 [4000/50000 (8%)]       Loss: 0.001928352983668
Train Epoch: 30 [4800/50000 (10%)]      Loss: 0.001779678394087
Train Epoch: 30 [5600/50000 (11%)]      Loss: 0.001772143295966
Train Epoch: 30 [6400/50000 (13%)]      Loss: 0.002132230903953
Train Epoch: 30 [7200/50000 (14%)]      Loss: 0.002036319579929
Train Epoch: 30 [8000/50000 (16%)]      Loss: 0.001871001324616
Train Epoch: 30 [8800/50000 (18%)]      Loss: 0.001188512775116
Train Epoch: 30 [9600/50000 (19%)]      Loss: 0.001497990451753
Train Epoch: 30 [10400/50000 (21%)]     Loss: 0.002008053474128
Train Epoch: 30 [11200/50000 (22%)]     Loss: 0.001462107058614
Train Epoch: 30 [12000/50000 (24%)]     Loss: 0.002120621968061
Train Epoch: 30 [12800/50000 (26%)]     Loss: 0.001964461756870
Train Epoch: 30 [13600/50000 (27%)]     Loss: 0.002503014635295
Train Epoch: 30 [14400/50000 (29%)]     Loss: 0.001460853964090
Train Epoch: 30 [15200/50000 (30%)]     Loss: 0.001815167139284
Train Epoch: 30 [16000/50000 (32%)]     Loss: 0.001704273396172
Train Epoch: 30 [16800/50000 (34%)]     Loss: 0.002081827027723

Here are some messages I see in dmesg tail

[    3.437240] evm: security.capability
[    3.439320] evm: HMAC attrs: 0x1
[    3.442087] PM:   Magic number: 7:309:436
[    3.444854] memory memory2777: hash matches
[    3.448244] RAS: Correctable Errors collector initialized.
[    3.453116] md: Waiting for all devices to be available before autodetect
[    3.457023] md: If you don't use raid, use raid=noautodetect
[    3.460307] md: Autodetecting RAID arrays.
[    3.462709] md: autorun ...
[    3.464498] md: ... autorun DONE.
[    3.519973] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[    3.524495] VFS: Mounted root (ext4 filesystem) readonly on device 8:17.
[    3.527969] devtmpfs: mounted
[    3.530889] Freeing unused decrypted memory: 2036K
[    3.534253] Freeing unused kernel image (initmem) memory: 2960K
[    3.543211] Write protecting the kernel read-only data: 30720k
[    3.546836] Freeing unused kernel image (text/rodata gap) memory: 2036K
[    3.550759] Freeing unused kernel image (rodata/data gap) memory: 1956K
[    3.601553] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[    3.605409] x86/mm: Checking user space page tables
[    3.650538] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[    3.654287] Run /sbin/init as init process
[    3.656644]   with arguments:
[    3.656646]     /sbin/init
[    3.656647]   with environment:
[    3.656648]     HOME=/
[    3.656649]     TERM=linux
[    3.656649]     BOOT_IMAGE=/boot/vmlinuz-5.15.0-1022-azure
[    8.207881] systemd[1]: Inserted module 'autofs4'
[    8.473363] systemd[1]: systemd 245.4-4ubuntu3.18 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid)
[    8.486132] systemd[1]: Detected virtualization microsoft.
[    8.489399] systemd[1]: Detected architecture x86-64.
[    8.787249] systemd[1]: Set hostname to <mona-4gpu>.
[   11.016341] hv_pci 47505500-0001-0000-3130-444531303244: PCI VMBus probing: Using version 0x10002
[   11.023579] hv_pci 47505500-0001-0000-3130-444531303244: PCI host bridge to bus 0001:00
[   11.027825] pci_bus 0001:00: root bus resource [mem 0x41000000-0x41ffffff window]
[   11.031957] pci_bus 0001:00: root bus resource [mem 0x1000000000-0x1401ffffff window]
[   11.034094] pci_bus 0001:00: No busn resource found for root bus, will use [bus 00-ff]
[   11.048112] pci 0001:00:00.0: [10de:102d] type 00 class 0x030200
[   11.059193] pci 0001:00:00.0: reg 0x10: [mem 0x41000000-0x41ffffff]
[   11.063797] pci 0001:00:00.0: reg 0x14: [mem 0x1000000000-0x13ffffffff 64bit pref]
[   11.068933] pci 0001:00:00.0: reg 0x1c: [mem 0x1400000000-0x1401ffffff 64bit pref]
[   11.075186] pci 0001:00:00.0: Enabling HDA controller
[   11.096958] pci_bus 0001:00: busn_res: [bus 00-ff] end is updated to 00
[   11.100555] pci 0001:00:00.0: BAR 1: assigned [mem 0x1000000000-0x13ffffffff 64bit pref]
[   11.106291] pci 0001:00:00.0: BAR 3: assigned [mem 0x1400000000-0x1401ffffff 64bit pref]
[   11.111438] pci 0001:00:00.0: BAR 0: assigned [mem 0x41000000-0x41ffffff]
[   11.339448] hv_pci 47505500-0002-0000-3130-444531303244: PCI VMBus probing: Using version 0x10002
[   11.345352] hv_pci 47505500-0002-0000-3130-444531303244: PCI host bridge to bus 0002:00
[   11.349654] pci_bus 0002:00: root bus resource [mem 0x42000000-0x42ffffff window]
[   11.356577] pci_bus 0002:00: root bus resource [mem 0x1800000000-0x1c01ffffff window]
[   11.358771] pci_bus 0002:00: No busn resource found for root bus, will use [bus 00-ff]
[   11.368143] pci 0002:00:00.0: [10de:102d] type 00 class 0x030200
[   11.373039] pci 0002:00:00.0: reg 0x10: [mem 0x42000000-0x42ffffff]
[   11.377468] pci 0002:00:00.0: reg 0x14: [mem 0x1800000000-0x1bffffffff 64bit pref]
[   11.382569] pci 0002:00:00.0: reg 0x1c: [mem 0x1c00000000-0x1c01ffffff 64bit pref]
[   11.388831] pci 0002:00:00.0: Enabling HDA controller
[   11.410746] pci_bus 0002:00: busn_res: [bus 00-ff] end is updated to 00
[   11.414308] pci 0002:00:00.0: BAR 1: assigned [mem 0x1800000000-0x1bffffffff 64bit pref]
[   11.419473] pci 0002:00:00.0: BAR 3: assigned [mem 0x1c00000000-0x1c01ffffff 64bit pref]
[   11.424629] pci 0002:00:00.0: BAR 0: assigned [mem 0x42000000-0x42ffffff]
[   11.755926] hv_pci 47505500-0003-0000-3130-444531303244: PCI VMBus probing: Using version 0x10002
[   11.761592] hv_pci 47505500-0003-0000-3130-444531303244: PCI host bridge to bus 0003:00
[   11.765494] pci_bus 0003:00: root bus resource [mem 0x43000000-0x43ffffff window]
[   11.774835] pci_bus 0003:00: root bus resource [mem 0x2000000000-0x2401ffffff window]
[   11.785433] pci_bus 0003:00: No busn resource found for root bus, will use [bus 00-ff]
[   11.792731] pci 0003:00:00.0: [10de:102d] type 00 class 0x030200
[   11.797690] pci 0003:00:00.0: reg 0x10: [mem 0x43000000-0x43ffffff]
[   11.801934] pci 0003:00:00.0: reg 0x14: [mem 0x2000000000-0x23ffffffff 64bit pref]
[   11.807265] pci 0003:00:00.0: reg 0x1c: [mem 0x2400000000-0x2401ffffff 64bit pref]
[   11.813646] pci 0003:00:00.0: Enabling HDA controller
[   11.835600] pci_bus 0003:00: busn_res: [bus 00-ff] end is updated to 00
[   11.838872] pci 0003:00:00.0: BAR 1: assigned [mem 0x2000000000-0x23ffffffff 64bit pref]
[   11.843780] pci 0003:00:00.0: BAR 3: assigned [mem 0x2400000000-0x2401ffffff 64bit pref]
[   11.848750] pci 0003:00:00.0: BAR 0: assigned [mem 0x43000000-0x43ffffff]
[   12.089166] hv_pci 47505500-0004-0000-3130-444531303244: PCI VMBus probing: Using version 0x10002
[   12.094796] hv_pci 47505500-0004-0000-3130-444531303244: PCI host bridge to bus 0004:00
[   12.098812] pci_bus 0004:00: root bus resource [mem 0x44000000-0x44ffffff window]
[   12.102907] pci_bus 0004:00: root bus resource [mem 0x2800000000-0x2c01ffffff window]
[   12.106674] pci_bus 0004:00: No busn resource found for root bus, will use [bus 00-ff]
[   12.113071] pci 0004:00:00.0: [10de:102d] type 00 class 0x030200
[   12.117586] pci 0004:00:00.0: reg 0x10: [mem 0x44000000-0x44ffffff]
[   12.121821] pci 0004:00:00.0: reg 0x14: [mem 0x2800000000-0x2bffffffff 64bit pref]
[   12.126641] pci 0004:00:00.0: reg 0x1c: [mem 0x2c00000000-0x2c01ffffff 64bit pref]
[   12.132753] pci 0004:00:00.0: Enabling HDA controller
[   12.155935] pci_bus 0004:00: busn_res: [bus 00-ff] end is updated to 00
[   12.161096] pci 0004:00:00.0: BAR 1: assigned [mem 0x2800000000-0x2bffffffff 64bit pref]
[   12.166092] pci 0004:00:00.0: BAR 3: assigned [mem 0x2c00000000-0x2c01ffffff 64bit pref]
[   12.171348] pci 0004:00:00.0: BAR 0: assigned [mem 0x44000000-0x44ffffff]
[   12.404542] systemd[1]: dev-disk-cloud-azure_resource\x2dpart1.device: Requested dependency After=network-online.target ignored (device units cannot be delayed).
[   12.412886] systemd[1]: dev-disk-cloud-azure_resource\x2dpart1.device: Requested dependency After=network.target ignored (device units cannot be delayed).
[   13.270614] systemd[1]: Unnecessary job for /sys/devices/virtual/misc/vmbus!hv_vss was removed.
[   13.275757] systemd[1]: Unnecessary job for /sys/devices/virtual/misc/vmbus!hv_fcopy was removed.
[   13.282592] systemd[1]: Created slice Slice for Azure VM Agent and Extensions.
[   13.292057] systemd[1]: Created slice system-modprobe.slice.
[   13.298934] systemd[1]: Created slice system-serial\x2dgetty.slice.
[   13.306077] systemd[1]: Created slice system-systemd\x2dfsck.slice.
[   13.312719] systemd[1]: Created slice User and Session Slice.
[   13.318152] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[   13.324729] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[   13.331780] systemd[1]: Reached target User and Group Name Lookups.
[   13.336987] systemd[1]: Reached target Slices.
[   13.340936] systemd[1]: Reached target Swap.
[   13.344734] systemd[1]: Reached target System Time Set.
[   13.349257] systemd[1]: Listening on Device-mapper event daemon FIFOs.
[   13.355436] systemd[1]: Listening on LVM2 poll daemon socket.
[   13.361039] systemd[1]: Listening on multipathd control socket.
[   13.366375] systemd[1]: Listening on Syslog Socket.
[   13.370877] systemd[1]: Listening on fsck to fsckd communication Socket.
[   13.376696] systemd[1]: Listening on initctl Compatibility Named Pipe.
[   13.382273] systemd[1]: Listening on Journal Audit Socket.
[   13.387046] systemd[1]: Listening on Journal Socket (/dev/log).
[   13.392118] systemd[1]: Listening on Journal Socket.
[   13.396594] systemd[1]: Listening on Network Service Netlink Socket.
[   13.401986] systemd[1]: Listening on udev Control Socket.
[   13.407644] systemd[1]: Listening on udev Kernel Socket.
[   13.414224] systemd[1]: Mounting Huge Pages File System...
[   13.420486] systemd[1]: Mounting POSIX Message Queue File System...
[   13.427957] systemd[1]: Mounting Kernel Debug File System...
[   13.434942] systemd[1]: Mounting Kernel Trace File System...
[   13.442323] systemd[1]: Starting Journal Service...
[   13.448078] systemd[1]: Starting Set the console keyboard layout...
[   13.455137] systemd[1]: Starting Create list of static device nodes for the current kernel...
[   13.464579] systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
[   13.474657] systemd[1]: Starting Load Kernel Module chromeos_pstore...
[   13.483176] systemd[1]: Starting Load Kernel Module drm...
[   13.489863] systemd[1]: Starting Load Kernel Module efi_pstore...
[   13.495960] systemd[1]: Starting Load Kernel Module pstore_blk...
[   13.503376] systemd[1]: Starting Load Kernel Module pstore_zone...
[   13.511644] systemd[1]: Starting Load Kernel Module ramoops...
[   13.516523] systemd[1]: Condition check resulted in OpenVSwitch configuration for cleanup being skipped.
[   13.522080] systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
[   13.528241] systemd[1]: Starting File System Check on Root Device...
[   13.535920] systemd[1]: Starting Load Kernel Modules...
[   13.541818] systemd[1]: Starting udev Coldplug all Devices...
[   13.548861] systemd[1]: Starting Uncomplicated firewall...
[   13.556112] systemd[1]: Starting Setup network rules for WALinuxAgent...
[   13.564372] systemd[1]: Started Journal Service.
[   13.705556] IPMI message handler: version 39.2
[   13.735757] ipmi device interface
[   13.764996] EXT4-fs (sdb1): re-mounted. Opts: discard. Quota mode: none.
[   13.775583] systemd-journald[373]: Received client request to flush runtime journal.
[   15.332995] cryptd: max_cpu_qlen set to 1000
[   15.456142] hv_vmbus: registering driver hyperv_keyboard
[   15.456714] hid: raw HID events driver (C) Jiri Kosina
[   15.462303] hv_vmbus: registering driver hid_hyperv
[   15.465473] input: Microsoft Vmbus HID-compliant Mouse as /devices/0006:045E:0621.0001/input/input4
[   15.467975] hid 0006:045E:0621.0001: input: VIRTUAL HID v0.01 Mouse [Microsoft Vmbus HID-compliant Mouse] on 
[   15.468395] input: AT Translated Set 2 keyboard as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/d34b2567-b9b6-42b9-8778-0a4ec0b955bf/serio2/input/input3
[   15.576399] hv_vmbus: registering driver hv_netvsc
[   15.580437] AVX2 version of gcm_enc/dec engaged.
[   15.580604] AES CTR mode by8 optimization enabled
[   15.805794] hv_vmbus: registering driver hyperv_drm
[   15.806276] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] Synthvid Version major 3, minor 5
[   15.806398] hyperv_drm 0000:00:08.0: vgaarb: deactivate vga console
[   15.806518] Console: switching to colour dummy device 80x25
[   15.807289] [drm] Initialized hyperv_drm 1.0.0 2020 for 5620e0c7-8062-4dce-aeb7-520c7ef76171 on minor 0
[   15.881180] hv_utils: KVP IC version 4.0
[   16.179344] Console: switching to colour frame buffer device 128x48
[   16.226140] bpfilter: Loaded bpfilter_umh pid 556
[   16.226439] Started bpfilter
[   16.362117] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] fb0: hyperv_drmdrmfb frame buffer device
[   19.727338] nvidia: loading out-of-tree module taints kernel.
[   19.727354] nvidia: module license 'NVIDIA' taints kernel.
[   19.727355] Disabling lock debugging due to kernel taint
[   19.748711] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   19.764615] nvidia-nvlink: Nvlink Core is being initialized, major device number 511

[   19.765900] nvidia 0001:00:00.0: enabling device (0540 -> 0542)
[   20.390246] nvidia 0001:00:00.0: can't derive routing for PCI INT A
[   20.390250] nvidia 0001:00:00.0: PCI INT A: no GSI
[   20.524538] nvidia 0002:00:00.0: enabling device (0540 -> 0542)
[   21.150585] nvidia 0002:00:00.0: can't derive routing for PCI INT A
[   21.150591] nvidia 0002:00:00.0: PCI INT A: no GSI
[   21.284665] nvidia 0003:00:00.0: enabling device (0540 -> 0542)
[   21.910894] nvidia 0003:00:00.0: can't derive routing for PCI INT A
[   21.910900] nvidia 0003:00:00.0: PCI INT A: no GSI
[   22.044839] nvidia 0004:00:00.0: enabling device (0540 -> 0542)
[   22.669651] nvidia 0004:00:00.0: can't derive routing for PCI INT A
[   22.669657] nvidia 0004:00:00.0: PCI INT A: no GSI
[   22.802849] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.141.10  Thu Sep 22 00:43:55 UTC 2022
[   22.913525] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.141.10  Thu Sep 22 00:32:43 UTC 2022
[   22.949241] [drm] [nvidia-drm] [GPU ID 0x00010000] Loading driver
[   22.949245] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0001:00:00.0 on minor 1
[   22.949332] [drm] [nvidia-drm] [GPU ID 0x00020000] Loading driver
[   22.949333] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0002:00:00.0 on minor 2
[   22.949419] [drm] [nvidia-drm] [GPU ID 0x00030000] Loading driver
[   22.949420] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0003:00:00.0 on minor 3
[   22.949502] [drm] [nvidia-drm] [GPU ID 0x00040000] Loading driver
[   22.949503] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0004:00:00.0 on minor 4
[   23.086864] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   23.090369] nvidia-uvm: Loaded the UVM driver, major device number 509.
[   23.236993] alua: device handler registered
[   23.238047] emc: device handler registered
[   23.239250] rdac: device handler registered
[   23.565821] loop0: detected capacity change from 0 to 3864
[   23.618753] loop1: detected capacity change from 0 to 129520
[   23.639421] loop2: detected capacity change from 0 to 129584
[   23.667327] loop3: detected capacity change from 0 to 101624
[   23.691346] loop4: detected capacity change from 0 to 188072
[   23.715323] loop5: detected capacity change from 0 to 138880
[   24.550754] audit: type=1400 audit(1672932266.892:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/chronyd" pid=723 comm="apparmor_parser"
[   24.551056] audit: type=1400 audit(1672932266.892:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/bin/freshclam" pid=720 comm="apparmor_parser"
[   24.563642] audit: type=1400 audit(1672932266.908:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=715 comm="apparmor_parser"
[   24.573621] audit: type=1400 audit(1672932266.916:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/clamd" pid=721 comm="apparmor_parser"
[   24.581578] audit: type=1400 audit(1672932266.924:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=722 comm="apparmor_parser"
[   24.581582] audit: type=1400 audit(1672932266.924:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=722 comm="apparmor_parser"
[   24.581585] audit: type=1400 audit(1672932266.924:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=722 comm="apparmor_parser"
[   24.581587] audit: type=1400 audit(1672932266.924:9): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/{,usr/}sbin/dhclient" pid=722 comm="apparmor_parser"
[   24.618851] audit: type=1400 audit(1672932266.960:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=724 comm="apparmor_parser"
[   24.618856] audit: type=1400 audit(1672932266.960:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=724 comm="apparmor_parser"
[   35.750398] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   40.903825]  sda: sda1
[   43.944170] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[   44.448569] hv_utils: VSS: userspace daemon ver. 129 connected
[   51.319477] sched: RT throttling activated
[   51.527379] hv_balloon: Max. dynamic memory size: 229376 MB
[   51.686636] loop6: detected capacity change from 0 to 8
[   51.734937] aufs 5.15.5-20211129
[   52.054647] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[   52.055019] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[   59.413039] kauditd_printk_skb: 23 callbacks suppressed
[   59.413043] audit: type=1400 audit(1672932301.755:35): apparmor="STATUS" operation="profile_load" profile="unconfined" name="docker-default" pid=2525 comm="apparmor_parser"
[   59.747119] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[   59.747695] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[   63.791195] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[   63.794139] Bridge firewalling registered
[   64.035685] Initializing XFRM netlink socket
[   67.370234] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[   67.370930] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[   74.831906] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[   74.832779] nvidia-modeset: WARNING: GPU:0: Correcting number of heads for current head configuration (0x00)
[  125.195514] Adding 63999996k swap on /mnt/swapfile.  Priority:-2 extents:41 across:68276220k FS
[  125.304783] FS-Cache: Loaded
[  125.396378] FS-Cache: Netfs 'cifs' registered for caching
[  125.399215] Key type cifs.spnego registered
[  125.399240] Key type cifs.idmap registered
[  125.409547] CIFS: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3.1.1), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3.1.1 (or even SMB3 or SMB2.1) specify vers=1.0 on mount.
$ gpustat
mona-4gpu  Thu Jan  5 16:09:50 2023  470.141.10
[0] Tesla K80 | 34°C,   0 % |     0 / 11441 MB |
[1] Tesla K80 | 27°C,   0 % |     0 / 11441 MB |
[2] Tesla K80 | 35°C,   0 % |     0 / 11441 MB |
[3] Tesla K80 | 28°C,   0 % |     0 / 11441 MB |

My code uses DataParallel for training mechanism in PyTorch and I utilized all 4 of the available GPUs.

Describe the bug A clear and concise description of what the bug is.

  1. Idle Shutdown is disabling me to figure the root cause of a training problem.
  2. Same training code with the same dataset run in a local GPU server (with 2 GPUs that only have 8G GPU memory) and started at same time of Azure VM node, are still continuing the training job with no error, and on epoch 31 while the actual job on VM failed (not sure what's the reason) at epoch 11.

To Reproduce Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen. To be able to figure how to debug the root cause of failure in training.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

pvaneck commented 1 year ago

Hey, @monajalal. I don't think this is Azure SDK specific and might be worth raising the issue here or with PyTorch.

In any case, I'll tag the ML team in case they might have some insight. @luigiw @azureml-github

During epoch 11, PyTorch terminates with no error message?

pr-scandas commented 1 year ago

Hi all,

I have the same issue on my side. The main point is that the idle shutdown activate while a job is running so I consider it's not directly linked to PyTorch. Where you able to find a solution to avoid this ?

I was either using tmux on my side. The overall process steps to reproduce is quite the same as described by @monajalal

Many thanks