-
Hello author.
The following codes and options were used for the training. (Code rewritten to work with that option, otherwise unchanged)
`python3 -m torch.distributed.launch --nproc_per_node=1 tra…
-
At least using xlnet model. When using high max_len, it doesn't print any error just crashes. Training with 1 GPU works well. When setting low max_len I get the error below. I'm using 4 Nvidia V100.
…
-
What is current support of FSDP2 in main pytorch?
I just see this here https://github.com/pytorch/pytorch/blob/main/torch/distributed/_composable/fully_shard.py#L45
> "`torch.distributed._composab…
-
It is the excellent work "Metis: Fast Automatic Distributed Training on Heterogeneous GPUs", however, I have a couple of questions about the code:
1. Why are the configuration files execution_memor…
-
**Dear author, thank you very much for your excellent work on this project. When I train my own SGDet model, I encounter two errors during the validation phase.
No.1 is as follows:**
`Traceback (m…
-
Hello! Upon system startup, opening Code-OSS hangs due to the files `/vs/platform/windows/electron-main/windowImpl.js` and `/vs/platform/windows/electron-main/windowsjs` required by `/vs/modules/patch…
-
Thank you for sharing your excellent work.
I'm interested in applying it to my research, so I followed your instructions to reproduce the results. However, when training on ImageNet, I repeatedly e…
-
Platforms: linux
This test was disabled because it is failing in CI. See [recent examples](https://hud.pytorch.org/flakytest?name=test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_run…
-
Hi, thanks for the nice work!
I tried to implement your code but found that the training was very slow. I saw that you use distributed training in the code. Could you kindly provide more info on your…
-
## Background 🌎
Working towards a first implementation of the solution it's important to know what's going to be built.
## Objective 🎯
Define and document the architecture for the first version o…