Closed linchunmian closed 2 years ago
Hi, training on multi-gpu needs to use torch.distributed function. I have made it work in my local machine, and I will release the version in the near future
Thanks. How should I modify the train script if I want to implement the function right now? Or some modifications in the train script?---- Replied Message ----FromRunsheng @.>Date07/07/2022 23:43 @.> @.**@.>SubjectRe: [DerrickXuNu/OpenCOOD] multi-gpus on single machine (Issue #27) Hi, training on multi-gpu needs to use torch.distributed function. I have made it work in my local machine, and I will release the version in the near future
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/DerrickXuNu/OpenCOOD/issues/27#issuecomment-1177819302", "url": "https://github.com/DerrickXuNu/OpenCOOD/issues/27#issuecomment-1177819302", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
Yes, you only need to do modifications on the train.py. There is nothing else you need to modify.
Also, should I make efforts on model saving if adopting distributed function? Another concern is how to customize my own V2V data based on CARLA+SUMO simulation? Could you please share the general pipeline of data generation? Many thanks!
No, you don't need to. For the question related to creating your own dataset, it is quite beyond the scope of this repo.
Thanks, you mean data generation process refer to the link to 'https://github.com/ucla-mobility/OpenCDA'?
Yes
No, you don't need to. For the question related to creating your own dataset, it is quite beyond the scope of this repo.
Thanks. I append distributed training into the train.py following PyTorch DistributedDataParallel function. However, the model fail to obtain the valid ap result (ap@50=0.0&ap@70=0.0) on the test data. What probably cause this problem in your experience?
Try one thing, when you load model from checkpoint, map it to cpu first.
Thanks. But it seems nothing happens. I just add distributed function without any other changes. should the BN be replace with SynBN?
If you train from continued checkpoint, is the loss normal?
When I train from the checkpoint you provided, the command and error information occurs as follow: ''' CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 opencood/tools/train.py --hypes_yaml opencood/hypes_yaml/point_pillar_intermediate_fusion.yaml --model_dir models/pointpillar_attentive_fusion/pointpillar_attentive_fusion
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
INFO - 2022-07-09 09:32:40,141 - distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 1
INFO - 2022-07-09 09:32:40,149 - distributed_c10d - Added key: store_based_barrier_key:1 to store for rank: 0
Dataset Building
Dataset Building
Creating Model
Creating Model
Traceback (most recent call last):
File "opencood/tools/train.py", line 207, in
There are two things you can try. First, change "latest.pth" to net_epoch30.pth (any number should work). Second, when you load the checkpoint, load it to CPU first, and remember set the strict flag to False
Thanks, I have solved this problem by replacing 'torch.save(model.state_dict(), ...)' with 'torch.save(model.module.state_dict(),...). Another concern is how to set correct batch size and learning rate. I see the default value in the config file is bs=2 and lr=0.002.
Thanks, I have solved this problem by replacing 'torch.save(model.state_dict(), ...)' with 'torch.save(model.module.state_dict(),...). Another concern is how to set correct batch size and learning rate. I see the default value in the config file is bs=2 and lr=0.002.
- for single-gpu case, should learning rate be linearly increased with the batch size, i.e., bs=8 -> lr=0.008?
- for multi-gpu case (e.g., 2 gpus), when adopting distributed training, the calculation of batch size should be the number of batch in all gpus, i.e., bs=8×2=16 -> lr=0.008×2=0.16, is it? By the way, how to determine the optimal training epoch? I do not get related training parameter in the config file from the checkpoint dir. Thanks again!
And I find at the same epoch, test ap achieved by the distributed model is evidently inferior to that of single-gpu one. Is it ok? And does any tricks be utilized for alleviating this performance gap?
Regarding your question about the performance gap between single and multiple GPU training, I guess you may have some overfitting on the multiple-GPU training. One epoch in 2-GPU training equals two epochs in single GPU training, so check your tf board whether an overfitting is happening
Many thanks. But I am still confused about your means:
Many thanks. For the first question, I mean does learning rate increase with batch size parameter, not the number of GPUs. One attempt I made is adopting 8 batch size and the same 0.002 initial learning rate in the one-gpu settiing, test ap result is much poor than that of bs=4 and lr=0.002.
I don't think LR need to increase with batch size. As I mentioned, the worse results may come from overfitting. You need to do early stopping
Thanks. I found the model convergent with 13-15 epoch, is it normal?
Is it on multi-GPUs?
Both on single- and multi-gpus, training epoch is set to 15 with 2 batch and 0.002 initial learning rate. Later, the pretrained model at 13 or 15 epoch reports the best validation and test ap result. I also enlarge the epoch to 30 under the same parameter setting, but the detection performances on valid and test splits are substantially inferior to that of model with 15 training epoch. Is it strange?
What model are you used for training?
I train the model from scratch.---- Replied Message ----FromRunsheng ***@***.***>Date07/11/2022 22:43 ***@***.***> ***@***.******@***.***>SubjectRe: [DerrickXuNu/OpenCOOD] multi-gpus on single machine (Issue #27)
What model are you used for training?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/DerrickXuNu/OpenCOOD/issues/27#issuecomment-1180497844", "url": "https://github.com/DerrickXuNu/OpenCOOD/issues/27#issuecomment-1180497844", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
Yeah, but which model did you train? Your own model or some I provided?
I train the point pillar with intermediate fusion from scratch ---- Replied Message ----FromRunsheng ***@***.***>Date07/11/2022 22:56 ***@***.***> ***@***.******@***.***>SubjectRe: [DerrickXuNu/OpenCOOD] multi-gpus on single machine (Issue #27)
Yeah, but which model did you train? Your own model or some I provided?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/DerrickXuNu/OpenCOOD/issues/27#issuecomment-1180516239", "url": "https://github.com/DerrickXuNu/OpenCOOD/issues/27#issuecomment-1180516239", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
Also, does the opv2v dataset provide the transformation information between lidar and camera (i.e. lidar->camera and camera->lidar)?
I have no idea about this for now. I suggest change to annealing learning stretagy and see whether it will become better.
Also, does the opv2v dataset provide the transformation information between lidar and camera (i.e. lidar->camera and camera->lidar)?
Yes, we do have the api. We plan to release it soon (probably next month)
Thanks.
Hi, how should I train the model with multi-gpus on single machine? Following the nn.dataparallel function, the error about tensor dimension mismatch (xx.view) regarding AttFusion class in self.attn.py file occurs. Please help me