Closed arbazk007 closed 2 years ago
Looking at those logs it seems that the scheduler starts correctly. When you say "running state with failed instance" what do you mean?
When I click on created ec2 instance from console and go to monitoring tab, it shows status check instance failed. I thought this could be related as I even though my instance is in running state I cannot ssh to log into it. Also, as I mentioned, I am stuck at waiting for scheduler to run at 10.178.104.100:8786 and it doesn't proceed further.
Are you able to open the dashboard at 10.178.104.100:8787
?
I am using jupyter lab 1.2 (sagemaker studio) so right now I am not able to access dashboard.
If port 8787
is open then you should be able to access the regular Dask web dashboard (not the Jupyter Lab one). You could also just open a terminal in jupyter and run curl http://10.178.104.100:8787
to check that it is accessible.
If the cluster manager is hanging like this then it's likely that the EC2 instance is not accessible from the Sagemaker instance. So to troubleshoot it we need to check that connectivity.
It throws connection timeout error
That sounds like the security groups aren't set up correctly to allow access to 8786
and 8787
then.
Were you able to check the security rules?
Yes, since dask cloudprovider doesn't provide an option to use private IP, we didn't use ec2 cluster. But yes, if we open ports 8786 and 8787 to all in security groups, it worked. Closing this issue.
👍. Just FYI private IP support was added in #353.
Thanks for the information @jacobtomlinson . I was able to create EC2 clusters using private ip.
I am able to create EC2 instance (running state with failed instance) using below command but the cluster stays in hang state:
Environment: Dask version: 2022.02.1 Dask Cloud Provider version: 2022.4.0 Python version: 3.9 Operating System: Ubuntu
Ports 8787 and 8786 are open in security groups.
Logs
``` f7 window] [ 2.063330] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window] [ 2.069017] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window] [ 2.074192] pci_bus 0000:00: resource 7 [mem 0xc0000000-0xfebfffff window] [ 2.079461] pci 0000:00:00.0: Limiting direct PCI/PCI transfers [ 2.084006] pci 0000:00:01.0: Activating ISA DMA hang workarounds [ 2.088842] pci 0000:00:03.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff] [ 2.096468] PCI: CLS 0 bytes, default 64 [ 2.100086] Trying to unpack rootfs image as initramfs... [ 2.107460] Freeing initrd memory: 4532K [ 2.111177] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) [ 2.115853] software IO TLB: mapped [mem 0x00000000bbfe9000-0x00000000bffe9000] (64MB) [ 2.123238] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x240937b9988, max_idle_ns: 440795218083 ns [ 2.131182] clocksource: Switched to clocksource tsc [ 2.135266] check: Scanning for low memory corruption every 60 seconds [ 2.140628] Initialise system trusted keyrings [ 2.144207] Key type blacklist registered [ 2.147881] workingset: timestamp_bits=36 max_order=23 bucket_order=0 [ 2.153607] zbud: loaded [ 2.156677] squashfs: version 4.0 (2009/01/31) Phillip Lougher [ 2.161204] fuse: init (API version 7.33) [ 2.165380] integrity: Platform Keyring initialized [ 2.176857] Key type asymmetric registered [ 2.180780] Asymmetric key parser 'x509' registered [ 2.184699] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 243) [ 2.191541] io scheduler mq-deadline registered [ 2.196038] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 [ 2.200817] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 [ 2.207589] ACPI: Power Button [PWRF] [ 2.211226] input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1 [ 2.217895] ACPI: Sleep Button [SLPF] [ 2.222113] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled [ 2.254070] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A [ 2.261298] Linux agpgart interface v0.103 [ 2.395717] loop: module loaded [ 2.399358] nvme nvme0: pci function 0000:00:04.0 [ 2.403293] PCI Interrupt Link [LNKD] enabled at IRQ 11 [ 2.407198] libphy: Fixed MDIO Bus: probed [ 2.410631] tun: Universal TUN/TAP device driver, 1.6 [ 2.414462] PPP generic driver version 2.4.2 [ 2.419781] VFIO - User Level meta-driver version: 0.3 [ 2.420868] nvme nvme0: 2/0/0 default/read/poll queues [ 2.424228] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver [ 2.432230] ehci-pci: EHCI PCI platform driver [ 2.432234] nvme0n1: p1 [ 2.435690] ehci-platform: EHCI generic platform driver [ 2.443062] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver [ 2.447167] ohci-pci: OHCI PCI platform driver [ 2.450607] ohci-platform: OHCI generic platform driver [ 2.454339] uhci_hcd: USB Universal Host Controller Interface driver [ 2.458566] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12 [ 2.465100] i8042: Warning: Keylock active [ 2.469265] serio: i8042 KBD port at 0x60,0x64 irq 1 [ 2.473462] serio: i8042 AUX port at 0x60,0x64 irq 12 [ 2.477770] mousedev: PS/2 mouse device common for all mice [ 2.482468] rtc_cmos 00:00: RTC can wake from S4 [ 2.487576] rtc_cmos 00:00: registered as rtc0 [ 2.493461] rtc_cmos 00:00: setting system clock to 2022-05-04T07:24:55 UTC (1651649095) [ 2.500693] rtc_cmos 00:00: alarms up to one day, 114 bytes nvram [ 2.505581] i2c /dev entries driver [ 2.509060] device-mapper: uevent: version 1.0.3 [ 2.513012] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: dm-devel@redhat.com [ 2.520293] platform eisa.0: Probing EISA bus 0 [ 2.524146] platform eisa.0: EISA: Cannot allocate resource for mainboard [ 2.528884] platform eisa.0: Cannot allocate resource for EISA slot 1 [ 2.533478] platform eisa.0: Cannot allocate resource for EISA slot 2 [ 2.538262] platform eisa.0: Cannot allocate resource for EISA slot 3 [ 2.542898] platform eisa.0: Cannot allocate resource for EISA slot 4 [ 2.548113] platform eisa.0: Cannot allocate resource for EISA slot 5 [ 2.552600] platform eisa.0: Cannot allocate resource for EISA slot 6 [ 2.557688] platform eisa.0: Cannot allocate resource for EISA slot 7 [ 2.562337] platform eisa.0: Cannot allocate resource for EISA slot 8 [ 2.567065] platform eisa.0: EISA: Detected 0 cards [ 2.570940] intel_pstate: CPU model not supported [ 2.575537] ledtrig-cpu: registered to indicate activity on CPUs [ 2.579754] drop_monitor: Initializing network drop monitor service [ 2.584086] NET: Registered protocol family 10 [ 2.588717] Segment Routing with IPv6 [ 2.592572] NET: Registered protocol family 17 [ 2.596740] Key type dns_resolver registered [ 2.601408] No MBM correction factor available [ 2.605291] IPI shorthand broadcast: enabled [ 2.609168] sched_clock: Marking stable (1795726210, 813420534)->(2756437377, -147290633) [ 2.616618] registered taskstats version 1 [ 2.620234] Loading compiled-in X.509 certificates [ 2.625769] Loaded X.509 cert 'Build time autogenerated kernel key: ecb0788dc66cd3f3e8014485463fe429aa2b1eba' [ 2.639339] Loaded X.509 cert 'Canonical Ltd. Live Patch Signing: 14df34d1a87cf37625abec039ef2bf521249b969' [ 2.651319] Loaded X.509 cert 'Canonical Ltd. Kernel Module Signing: 88f752e560a1e0737e31163a466ad7b70a850c19' [ 2.659552] blacklist: Loading compiled-in revocation X.509 certificates [ 2.666509] Loaded X.509 cert 'Canonical Ltd. Secure Boot Signing: 61482aa2830d0ab2ad5af10b7250da9033ddcef0' [ 2.679308] zswap: loaded using pool lzo/zbud [ 2.683593] Key type ._fscrypt registered [ 2.687386] Key type .fscrypt registered [ 2.691110] Key type fscrypt-provisioning registered [ 2.696260] Key type encrypted registered [ 2.700147] AppArmor: AppArmor sha1 policy hashing enabled [ 2.704363] ima: No TPM chip found, activating TPM-bypass! [ 2.709635] ima: Allocated hash algorithm: sha1 [ 2.713924] ima: No architecture policies found [ 2.718706] evm: Initialising EVM extended attributes: [ 2.723132] evm: security.selinux [ 2.726494] evm: security.SMACK64 [ 2.729956] evm: security.SMACK64EXEC [ 2.734973] evm: security.SMACK64TRANSMUTE [ 2.738835] evm: security.SMACK64MMAP [ 2.742264] evm: security.apparmor [ 2.745610] evm: security.ima [ 2.748883] evm: security.capability [ 2.752242] evm: HMAC attrs: 0x1 [ 2.755840] PM: Magic number: 6:408:419 [ 2.760519] RAS: Correctable Errors collector initialized. [ 3.004423] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input2 [ 3.013128] md: Waiting for all devices to be available before autodetect [ 3.018209] md: If you don't use raid, use raid=noautodetect [ 3.022511] md: Autodetecting RAID arrays. [ 3.026006] md: autorun ... [ 3.029242] md: ... autorun DONE. [ 3.041599] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. [ 3.048699] VFS: Mounted root (ext4 filesystem) readonly on device 259:1. [ 3.053231] devtmpfs: mounted [ 3.057123] Freeing unused decrypted memory: 2036K [ 3.061318] Freeing unused kernel image (initmem) memory: 2688K [ 3.067764] Write protecting the kernel read-only data: 30720k [ 3.072418] Freeing unused kernel image (text/rodata gap) memory: 2036K [ 3.077263] Freeing unused kernel image (rodata/data gap) memory: 1916K [ 3.158167] x86/mm: Checked W+X mappings: passed, no W+X pages found. [ 3.162430] x86/mm: Checking user space page tables [ 3.240417] x86/mm: Checked W+X mappings: passed, no W+X pages found. [ 3.244647] Run /sbin/init as init process [ 3.856100] systemd[1]: Inserted module 'autofs4' [ 3.906327] systemd[1]: systemd 245.4-4ubuntu3.13 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=hybrid) [ 3.921600] systemd[1]: Detected virtualization kvm. [ 3.925253] systemd[1]: Detected architecture x86-64. Welcome to [1mUbuntu 20.04.3 LTS[0m! [ 3.950734] systemd[1]: Set hostname toIs there anything that I am doing wrong or missing here?