distributed-training Search Results

1000+ results
for distributed-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

deepseek-ai/DeepSeek-Coder #79

how to finetune in single gpu

cd finetune && deepspeed finetune_deepseekcoder.py --model_name_or_path $MODEL_PATH --data_path $DATA_PATH --output_dir $OUTPUT_PATH --num_train_epochs 3 --model_max_length 1024 …

sxsxsx updated 5 months ago
1
fundamentalvision/Deformable-DETR #204

The project stack at ' torch.distributed.init_process_group'

I try to run the command for training Deformable DETR on one node with 8 GPUs is as following: ```bash GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_deformable_detr.sh ``` It works.…

RedBlack888 updated 4 months ago
1
NVIDIA/Megatron-LM #209

Pytorch distributed runtime check failure when using pipelin…

Hope this issue post would be helpful to others who suffer from similar problem. I am trying to run `examples/pretrain_gpt_distributed_with_mp.sh`, but when pipeline model parallelism is enabled, the…

insujang updated 1 year ago
1
Sara-Ahmed/SiT #7

single node multi-GPU hangs

Hi, I am running SSL training on a single node with two GPUs. It runs only when --nproc_per_node=1. When I set nproc_per_node=2 it gets stuck after init for the second GPU. init_distributed_mode .…

memphizz updated 2 years ago
4
sankin97/LoGoNet #2

precision on kitti test set

Thank you for sharing this project！I have a issue to ask you: I try to train LoGoNet with all KITTI training data for 80 epochs and submit results to KITTI test set. But the precision of car on moder…

Z-Lee-corder updated 2 months ago
2
openai/improved-diffusion #113

mpiexec running out of memory in multi-GPU

Hello, I am having an issue in using mpiexec to distribute the training. It seems that I can run training on a single GPU using the following parameters: `MODEL_FLAGS="--image_size 256 --num_cha…

jeong-jasonji updated 5 months ago
3
facebookresearch/3detr #50

training on my own data in sunrgbd format raises error

Hello @likethesky , @Celebio , @colesbury , @pdollar , @minqi , thank you for this is amazing work of 3detr. I have built my dataset with sunrgbd format and it already worked with Votenet, but wh…

madinwei updated 2 weeks ago
1
YifanXu74/Evo-ViT #7

Environment issue

Awesome work ! But I can not run the project correctly yet. Please provide me some information, thanks !

King4819 updated 7 months ago
3
3DTopia/OpenLRM #40

ValueError: math domain error

## summary * error happens when training * tested on Runpod's A100 SXM 80GB x4 GPUs, 128 vCPU 1006 GB RAM * runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 ## reproduction of the error …

hayoung-jeremy updated 3 months ago
5
MilesCranmer/PySR #661

[BUG]: EXCEPTION_ACCESS_VIOLATION during garbage collection …

### What happened? The program crashed while using PySR, with an error message indicating a memory access violation (EXCEPTION_ACCESS_VIOLATION). This error occurred during the garbage collection pro…

zzccchen updated 4 days ago
22

上一页 1...79 80 81 82 83 84 85...100 下一页

1000+ results for distributed-training

1000+ results
for distributed-training