-
cd finetune && deepspeed finetune_deepseekcoder.py --model_name_or_path $MODEL_PATH --data_path $DATA_PATH --output_dir $OUTPUT_PATH --num_train_epochs 3 --model_max_length 1024 …
-
I try to run the command for training Deformable DETR on one node with 8 GPUs is as following:
```bash
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_deformable_detr.sh
```
It works.…
-
Hope this issue post would be helpful to others who suffer from similar problem.
I am trying to run `examples/pretrain_gpt_distributed_with_mp.sh`, but when pipeline model parallelism is enabled, the…
-
Hi,
I am running SSL training on a single node with two GPUs. It runs only when --nproc_per_node=1. When I set nproc_per_node=2 it gets stuck after init for the second GPU.
init_distributed_mode .…
-
Thank you for sharing this project!I have a issue to ask you:
I try to train LoGoNet with all KITTI training data for 80 epochs and submit results to KITTI test set. But the precision of car on moder…
-
Hello, I am having an issue in using mpiexec to distribute the training.
It seems that I can run training on a single GPU using the following parameters:
`MODEL_FLAGS="--image_size 256 --num_cha…
-
Hello @likethesky , @Celebio , @colesbury , @pdollar , @minqi ,
thank you for this is amazing work of 3detr.
I have built my dataset with sunrgbd format and it already worked with Votenet, but wh…
-
Awesome work ! But I can not run the project correctly yet. Please provide me some information, thanks !
-
## summary
* error happens when training
* tested on Runpod's A100 SXM 80GB x4 GPUs, 128 vCPU 1006 GB RAM
* runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
## reproduction of the error
…
-
### What happened?
The program crashed while using PySR, with an error message indicating a memory access violation (EXCEPTION_ACCESS_VIOLATION). This error occurred during the garbage collection pro…