-
The vae_checkpoint you provided is diffusion_pytorch_model.safetensors, not a .ckpt or .pth file. But in ldm/train_unconditional.py, vae_checkpoint = torch.load(args.vae_checkpoint, map_location='cpu'…
-
如题,训练stage-3,训练时正常,推理出现错误:
Traceback (most recent call last):
File "/root/GOT-OCR2.0/GOT-OCR-2.0-master/GOT/demo/run_ocr_2.0.py", line 245, in
eval_model(args)
File "/root/GOT-OCR2.0/…
-
**Dear author, thank you very much for your excellent work on this project. When I train my own SGDet model, I encounter two errors during the validation phase.
No.1 is as follows:**
`Traceback (m…
-
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/wiz-llm-storage/anaconda3/envs/h2_nemo/lib/python3.10/site-packages/torch/utils…
-
Hi there,
amazing work :)
I just encountered an error while trying to run the library on an Apple M3 Max. Below is a MWE to reproduce the error. The example itself doesn't make sense but at leas…
-
## This is for bugs only
Did you already ask [in the discord](https://discord.gg/VXmU2f5WEU)?
No
You verified that this is a bug and not a feature request or question by asking [in the discor…
-
In my understanding, in pretrain code, it broadcasts the data from tp rank 0 to the rest tp rank gpus.
However, if i activate the option `train_valid_test_datasets_provider.is_distributed = True` wh…
-
Hi,
I'm using XHumans Dataset which provide mesh. I rendered normal with camera poses and got error as following while training the model. Any insight would be helpful.
```
true_normal torch.Siz…
-
**Reproduction**
I am trying to finetune Qwen2-0.5B model on some training data using a multi-GPU setup. The same code (given further below) seems to work in a single-GPU setting (when i set CUDA_V…
-
## 🐛 Bug
Hi, we are using lightning with litdata on our local machine and aws s3 system. However, training would hang randomly during the very first iterations with ddp and remote cloud directory.
…