training problem - Githubissues

callsys / ControlCap

[ECCV 2024] ControlCap: Controllable Region-level Captioning

49 stars 0 forks source link

training problem #3

Closed nicehuster closed 3 months ago

nicehuster commented 3 months ago

I tried training models with different configurations and found that they all couldn't converge. After 1k iterations, the loss values would basically become nan. What is the problem？ 17166061912336

callsys commented 3 months ago

What is the configuration of your server? It looks like you're using a small batch size. (We train the model with batch size 64 8 on 8 A800 80G)

nicehuster commented 3 months ago

What is the configuration of your server? It looks like you're using a small batch size. (We train the model with batch size 64 8 on 8 A800 80G)

The models are trained using 8 NVIDIA A100 GPUs(80G), and use the model's default configuration.

callsys commented 3 months ago

I test with command python -m torch.distributed.run --nproc_per_node=8 --master_port=29600 train.py --cfg-path configs/train/union/vg1.2_refcocog_5e.yaml. It looks like your batch size is half of mine.

nicehuster commented 3 months ago

I test with command python -m torch.distributed.run --nproc_per_node=8 --master_port=29600 train.py --cfg-path configs/train/union/vg1.2_refcocog_5e.yaml. It looks like your batch size is half of mine.

I use your command to retrain, the loss are also nan, 1716611458327

callsys commented 3 months ago

Thanks for your feedback, we will test the code again. This afternoon we will open source another ControlCap-based work (Same environment with ControlCap), i.e., DynRefer, which contains all the ControlCap features and requires less memory for higher performance. You can wait for us to test the code again or turn to our new framework. We can remind you when the code is ready.

nicehuster commented 3 months ago

Ok, thanks, i look forward to your works!

callsys commented 3 months ago

We clean and rerun the code. The weight of the tagging loss is set to half of the orginal paper to make the training more stable. Here are the checkpoint and log file: ckpts/vg1.2_refcocog_5e.(pth/txt). Besides, you can also use our new framework DynRefer, which contains all the ControlCap features and has better performance. Thanks!

nicehuster commented 3 months ago

We clean and rerun the code. The weight of the tagging loss is set to half of the orginal paper to make the training more stable. Here are the checkpoint and log file: ckpts/vg1.2_refcocog_5e.(pth/txt). Besides, you can also use our new framework DynRefer, which contains all the ControlCap features and has better performance. Thanks!

ok, thanks.