Closed 3bsamad closed 1 year ago
Dear @3bsamad,
thank you for trying out PoET! Can I ask you how big your images are and how many images are in your dataset? I am just curious as to why the training takes so long on one GPU. Unfortunately, I am not an expert on distributed training myself and I can not help you from the top of my head. However, I will look into this issue and try to educate myself more with respect to this topic.
In the mean time, I kindly refer you to the repositories of the Deformable-DETR and DETR. PoET builds essentially on top of these two repositories and they provide some more information and scripts for distributed training. Maybe, there are some issues similar to yours in the two repositories. I am pretty sure if you can get any of these two to run in distributed mode, it should work for PoET as well.
Please keep me updated if you make any progress and do not hesitate to issue a pull request, once you have a solution that provides a fix. I will come back to you once I have time to look into this topic in more detail.
Best, Thomas
Dear @3bsamad,
thank you for trying out PoET! Can I ask you how big your images are and how many images are in your dataset? I am just curious as to why the training takes so long on one GPU. Unfortunately, I am not an expert on distributed training myself and I can not help you from the top of my head. However, I will look into this issue and try to educate myself more with respect to this topic.
In the mean time, I kindly refer you to the repositories of the Deformable-DETR and DETR. PoET builds essentially on top of these two repositories and they provide some more information and scripts for distributed training. Maybe, there are some issues similar to yours in the two repositories. I am pretty sure if you can get any of these two to run in distributed mode, it should work for PoET as well.
Please keep me updated if you make any progress and do not hesitate to issue a pull request, once you have a solution that provides a fix. I will come back to you once I have time to look into this topic in more detail.
Best, Thomas
I am trying on a small subset of my dataset, which is only about 3900 train and 1300 test images. The images are of dimensions 800x400. It doesn't take "so long", one epoch on these is about 6-9 minutes, but I wanted to make use of my other two GPUs since training on my full dataset would be very slow. Thanks for you help, I will look more into this and try to find a solution.
@tgjantos I have a preliminary working solution from the DETR repo, I integrated it into this repo and I can now train on my 3 GPUs. Only problem is CUDA out of memory, but this probably has to do with the model itself/ data size. I'm looking into in right now, I can make a pull request if you want :)
Update: everything works smoothly now, was only a small error
@3bsamad sounds awesome! Definitely make a pull request, would be happy to integrate it into the repo!
Best, Thomas
Closed with #11
First of all, thanks for the great work!
I am using the provided docker image, and currently I am trying to run distributed training, since training on only one GPU is slow. I have 3 GTX 1080s (IDs 0,1,2). I added the following args to
get_args_parser()
inmain.py
since I couldn't findargs.distributed
:Then, in
util/misc.py
, ininit_distributed_mode(args)
I added the following:Everything works fine when I start training up until this point:
Where it gets stuck, showing this in the terminal
| distributed init (rank 0): env://
Until I kill the process.I have tried distributed training in docker before, using this simple example script:
I am a bit new to implementing distributed training, and was wondering what might be wrong/missing here. Any help would be appreciated!