distributed-training Search Results

1000+ results
for distributed-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

coreweave/tensorizer #81

Tensorizer Support for Large Models( 70B+) that dont fit int…

I'm currently evaluating Tensorizer for handling large models, specifically models with parameters as larger than 70B that cannot be fit into a single GPU. I have a few questions and concerns reg…

SujoyDutta updated 3 months ago
10
facebookresearch/ViewDiff #10

Nan loss during training

Hi team, thanks for sharing this great work. I have a problem that when training with train.sh on 40GB A100. I set the batch_size=2 and gradient_accumulation_steps=16 with LR=5e-5 and 2.5e-5. The trai…

ItsThanhTung updated 3 months ago
6
pytorch/ignite #3025

Extension of `ignite.distributed.utils.broadcast` for pathli…

## 🚀 Feature I would like to request for extension of `ignite.distributed.utils.broadcast` for Path datatype as it is frequently used for distributed training and can be very useful for designing D…

guptaaryan16 updated 10 months ago
2
OptimalScale/LMFlow #842

Full parameter fine-tuning cannot be trained

(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune.sh \ --model_name_or_path /data/guihunmodel8.8B \ --dataset_path /data/projects/lmflow/case_report_data \ --out…

orderer0001 updated 1 month ago
1
microsoft/SimMIM #38

How can I solve this problem

/root/miniconda3/bin/python: can't open file 'main_simmim.py--cfg': [Errno 2] No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19…

haibo12 updated 1 month ago
2
swcarpentry/hpc-novice #27

Reference: already existing carpentry-like hpc-novice-like l…

If you come across any carpentry-like training materials for a cluster/distributed system, please put them here.

ChristinaLK updated 6 years ago
15
pytorch/pytorch #60343

What changes we need to make in metrics calculation and visu…

## ❓ Questions and Help I am updating my training script to use Distributed Data Parallel to do Multi-GPU training. I am done with most of the steps as mentioned in PyTorch Guidelines. But I am c…

Bilal-Yousaf updated 3 years ago
2
jasperzhong/read-papers-and-code #283

ICC workshop '20 | DistDGL: Distributed Graph Neural Network…

https://arxiv.org/pdf/2010.05337

jasperzhong updated 2 years ago
3
pytorch/text #1295

How to train data with the similar number of tokens in a bat…

My code needs two functions: 1. Bucket iterator; 2. In each batch, the number of tokens are similar. (This means the batch size of each batch is not same.) I think I could fulfill the function …

sandthou updated 3 years ago
8
facebookresearch/moco-v3 #34

About the learning rate for resnet-50

I met an issue training resnet-50 with moco-v3. Under the distributed training setting with 16 V100 GPUs (each process only has one gpu, batch size 4096), I can get the training loss at about 27.2 in …

cswaynecool updated 1 year ago
1

上一页 1...82 83 84 85 86 87 88...100 下一页

1000+ results for distributed-training

1000+ results
for distributed-training