Training the model with another setup - Githubissues

SarahwXU / HiSup

MIT License

121 stars 17 forks source link

Training the model with another setup #16

Open minhvu120201dn opened 1 year ago

minhvu120201dn commented 1 year ago

I have tried training the model with:

RTX 2080 Super GPU with 8GB VRAM
Backbone: HRNetW48-V2
Number of epochs: 30
Dataset: AICrowd small But only obtained 52.0 on AP, while that of the original paper is 75.8. Can anyone explain the reason why?

cherubicXN commented 1 year ago

We used all the training data containing 280,741 tiles for the final model. The small version was only utilized for ablation studies.

Best

Nan 2023年6月21日 +0800 AM1:54 minhvu120201dn @.***>，写道：

I have tried training the model with:

• RTX 2080 Super GPU with 8GB VRAM • Backbone: HRNetW48-V2 • Number of epochs: 30 • Dataset: AICrowd small But only obtained 52.0 on AP, while that of the original paper is 75.8. Can anyone explain the reason why?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

zem118 commented 1 year ago

Hi author, I'm using RTX2080TI, 12G, gpu for training. The dataset used: crowdAI 20% of the original size, 60,000 images for training, the Using this network: crowdai-small_hrnet48.yaml But it still prompts me for a video memory overflow. I would like to ask what is the reason for this, is it because the size of my dataset is bigger or because this network of yours is bigger, author?

SarahwXU commented 1 year ago

I am not sure what is "a video memory overflow". We never encounter this error message during all experiments. Please make sure that you can run the demo and get predictable results. Then, I suggest that you could try the following changes during training. One is to reduce the batch size, which will reduce the GPU memory usage. Another is to replace the HRNet48 with a smaller version such as HRNet18.

zem118 commented 1 year ago

Ok, thanks for the reply, I've solved the problem! It works successfully on single GPU. Now my computer is with two GPUs (3080) and I want to train on multiple GPUs, I used the multi-train.py from your model for training, but the run is stuck (that is, he doesn't report an error or continue to run, and it doesn't return information about the training process) I don't know why this is. So I would like to ask you if there are any other additional operations you do when training with multiple GPUs?

XJKunnn commented 1 year ago

Hi, Is your CUDA capability compatible with the current PyTorch version?

zem118 commented 1 year ago

cuda is compatible with pytorch and it has been able to run successfully on a single GPU successfully. It just doesn't run successfully on dual GPUs (both GPUs are idle)

XJKunnn commented 1 year ago

Could you please share the log while running the training code?

zem118 commented 1 year ago

In the terminal, it gets stuck at index created! b855a1fa2358c0759a301aa47b3713a

XJKunnn commented 1 year ago

I just run the multi-gpu code and it runs well. Maybe you should check your environment carefully and follow the steps of README file.

zem118 commented 1 year ago

Okay, thanks for the answer.