-
### 🐛 Describe the bug
Hello,
I'm a new user of PyTorch and recently tried to run the Flight Recorder code provided in the tools. But I cannot get the code to execute as expected.
I use ngc 24.10…
-
你好作者,我在跑你们训练的时候遇到了这个问题,请问有解决的方式吗?
/home/amax/anaconda3/bin/conda run -n WalMaFa --no-capture-output python /data1/WalMaFa/train.py
load training yaml file: ./configs/LOL/train/training_LOL.yaml
==…
-
Here is a simple rundown what ChatGPT had to say about it:
Combining `argparse` and `Hydra` is a useful approach when you want to manage configurations using Hydra while still maintaining some fl…
-
Hi thanks for the library! I have a naive thought: We know deep learning forward/backward cannot be parallelized, because you have to compute one operation/layer before computing the next one. But wha…
-
### bug描述 Describe the Bug
paddlepaddle-gpu 2.6.0.post117
paddlenlp : https://github.com/ZHUI/PaddleNLP, branch : sci/benckmark
commit id 20fe363530c0e3868414f65ec394124ffac6…
-
-
First of all, thank you for your amazing work on the nnScaler project. It has been incredibly inspiring, and I’ve been learning and using the contents from this repository in my own work.
I have a fe…
-
### System Info
```shell
accelerate 1.1.1
neuronx-cc 2.14.227.0+2d4f85be
neuronx-distributed 0.8.0
neuronx-distributed-training 1.0.0
optimum …
-
Greetings! After many confirmations, I found that the chains parameter is invalid for models such as HDDMrl. This will get:
`PicklingError: Could not pickle the task to send it to the workers.`
…
-
hi, I've tried training on a 32 core machine, naturally i set num_parallel to 32. However the model does not seem to learn at all. Weirdly, when i set num_parallel to 6, the model learns.
The rest of…