-
How should I prepare my code (data loaders, model, etc..) in order to train in a both Data and Expert Parallel mode?
And what does it change from "auto", "model" and "data" --parallel type?
In my …
-
## 🐛 Bug
when saving a ddp module with torch.save(), unexpected picking errors occured.
my model uses encoder-decoder framework, and the encoder contains a BertModel from transformers(Huggingface)…
-
## Instructions To Reproduce the Issue:
to speedup training, I add torch.compile operation after DistributedDataParallel in detectron2/engine/defaults.py:
```
ddp = DistributedDataParallel(mo…
-
I am trying to train on a 8xA100 instance. If I set `trainer_arguments.gradient_checkpointing` to `True`, the training hangs for a while and then dies with a `Segmentation fault (core dumped)` error. …
-
Back in September pytorch introduced `torch.optim._multi_tensor` https://github.com/pytorch/pytorch/pull/43507 which should be much more efficient for situations with lots of small feature tensors (`t…
-
### System Info
trl official DPO examples. Finetune llama3.1 with lora.
params:
lora_rank: 32
lora_target: all
pref_beta: 0.2
pref_loss: sigmoid
### dataset
dataset: train_data
template:…
-
Has anyone encountered the following problem? I used SiD-LSG to distill an SDXL model (made some code adaptations to the text-encoder), and some color spots appeared on the face, which were very obvio…
-
## 🐛 Bug
Returning None from training_step with multi GPU DDP training freezes the training without exception
### To Reproduce
Starting multi-gpu training with a None-returning training_step fu…
-
I'm trying to register SLURM nodes as agents for sweeps. I'm using Pytorch Lightning with DDP and multiple GPUs. Following the recommendations from Pytorch Lightning ([here](https://lightning.ai/docs/…
-
在V100上,由于只有1个GPU,修改配置文件use_ddp: False,运行 python train/train_personalized.py configs/training/lora_personalization.yaml
报错
`File "UltraPixel-main/train/train_personalized.py", line 368, in setup_opti…