-
I'm currently evaluating Tensorizer for handling large models, specifically models with parameters as larger than 70B that cannot be fit into a single GPU.
I have a few questions and concerns reg…
-
Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The trai…
-
## 🚀 Feature
I would like to request for extension of `ignite.distributed.utils.broadcast` for Path datatype as it is frequently used for distributed training and can be very useful for designing D…
-
(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune.sh \
--model_name_or_path /data/guihunmodel8.8B \
--dataset_path /data/projects/lmflow/case_report_data \
--out…
-
/root/miniconda3/bin/python: can't open file 'main_simmim.py--cfg': [Errno 2] No such file or directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19…
-
If you come across any carpentry-like training materials for a cluster/distributed system, please put them here.
-
## ❓ Questions and Help
I am updating my training script to use Distributed Data Parallel to do Multi-GPU training.
I am done with most of the steps as mentioned in PyTorch Guidelines.
But I am c…
-
https://arxiv.org/pdf/2010.05337
-
My code needs two functions:
1. Bucket iterator;
2. In each batch, the number of tokens are similar. (This means the batch size of each batch is not same.)
I think I could fulfill the function …
-
I met an issue training resnet-50 with moco-v3. Under the distributed training setting with 16 V100 GPUs (each process only has one gpu, batch size 4096), I can get the training loss at about 27.2 in …