distributed-training Search Results

1000+ results
for distributed-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

microsoft/DeepSpeed #1613

[BUG] moe model training on 2nodes would fail if using RDMA

**Describe the bug** We were trying to train a moe (ds experts = 2, expert size 8b) model on 2 A100 (40G) nodes, zero stage 2, - Training would fail if using rdma with model constructed by deepsp…

AliceChenyy updated 2 years ago
1
PaddlePaddle/PaddleNLP #8365

[Bug]: 使用amp_master_grad的同时开启recompute，weight没有main_grad

### 软件环境 ```Markdown - paddlepaddle: - paddlepaddle-gpu: 2.6 - paddlenlp: 2.7.1.post0 ``` ### 重复问题 - [X] I have searched the existing issues ### 错误描述 ```Markdown 正常情况下，开启--amp_m…

Wong4j updated 1 month ago
5
facebookarchive/caffe2 #1337

Could there be a detailed, developer-reference-guide like do…

Is there a doc introducing the usage for distributed parameters like "num_shards, shard_id, run_id, distributed_transport and distributed_interfaces, etc"? Seems there are even no terminology explana…

shijieheping updated 5 years ago
1
broadinstitute/keras-rcnn #211

Error while calculating val_loss using validation_data

I have created new JSON file according to my requirement : * `training.json` * `test.json` the model trains using `training.json` but gives error while calculating val_loss using `test.json` I …

pinakinathc updated 5 years ago
1
huggingface/transformers #31293

`merge_and_unload` for a quantized model ruins its quality

### System Info - `transformers` version: 4.41.2 - Platform: Linux-5.15.0-1044-nvidia-x86_64-with-glibc2.35 - Python version: 3.10.0 - Huggingface_hub version: 0.23.0 - Safetensors version: 0.4.2…

Aktsvigun updated 1 week ago
1
anosorae/IRRA #23

Multi gpu training problem

Thank you for your excellent work! I am very interested in your work and am currently using multiple GPUs for distributed training. As a beginner, I would like to ask if it is normal for the number of…

Qusijia updated 6 months ago
2
karpathy/llm.c #138

The loss doesn't seem to converge after 1000 iterations

I'm working on the C version of the code in preparation for (#40) So llm.c with **no** code modifications I observe the following: - `test_gpt2` works successfully and the loss matches - `train_g…

Yiltan updated 2 months ago
4
mailong25/self-supervised-speech-recognition #35

finetune.py optimization.update_freq

I was wondering why in the finetune.py file you've set update_freq to be 24/NUM_GPU. ``` cmd.append("+optimization.update_freq='[" + str(int(24/NUM_GPU)) + "]'") ``` In the wav2vec Readme …

TaridaGeorge updated 3 years ago
3
brandleyzhou/DIFFNet #10

Multi-GPU training hangs

Hello, When I start multi gpu training. I run the following command. python -m torch.distributed.launch --nproc_per_node=2 train.py --split eigen_zhou --learning_rate 1e-4 --height 320 --width 1024 …

tushardmaske updated 1 year ago
3
Muennighoff/sgpt #32

fine-tune sgpt-bloom-7b1-msmarco oom

Hi, I have problem in fine-tunning sgpt-bloom-7b1-msmarco because of oom error, could you please share how you do contrasive fine-tuning on bloom-7b1? (I think distributed training is needed, but I fa…

wing7171 updated 1 year ago
1

上一页 1...94 95 96 97 98 99 100...100 下一页

1000+ results for distributed-training

1000+ results
for distributed-training