-
### Your current environment
```text
The output of `python collect_env.py`
```
### 🐛 Describe the bug
We received quite a lot report about "Watchdog caught collective operation timeout", which …
-
implemented the pychron watchdog
check status at http://129.138.12.35:5001/manage
email's matt, julia and jake on failure
-
I'm not sure if cifsd is a part of this driver, or is supplied by the host OS. I wasn't able to locate the cifsd process either on the file system of the affected hosts.
Furthermore, reading the sour…
-
Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).
Core 0 register dump:
PC : 0x400e35c8 PS : 0x00060b35 A0 : 0x800e1f71 A1 : 0x3ffd08f0
A2 : 0…
-
Do we want to have a watchdog / does it work
-
Need to implement watchdog timer interrupts.
From what I understand the main blocker is that QEMU does not properly support critical interrupts on PowerPC. Without them no proper watchdog support w…
-
您好,在微调52B时出现了如何报错,具体是在保存模型中转化 Lora层参数时。代码卡在 TeleChat-52B/deepspeed-finetune/utils/module/lora.py -> convert_lora_to_linear_layer -> with deepspeed.zero.GatheredParameters(),使用 zero3+Lora,报错信息如下:
epoc…
-
### ⚠️ Please check that this feature request hasn't been suggested before.
- [X] I searched previous [Ideas in Discussions](https://github.com/OpenAccess-AI-Collective/axolotl/discussions/categories…
-
Enable the [watchdog kselftest](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/watchdog/watchdog-test.c?h=v6.9-rc6) on all the Chromebooks in the Colla…
-
No matter what you do, you will get this error