aharley / simple_bev

A Simple Baseline for BEV Perception
MIT License
483 stars 75 forks source link

issue in bevformer2 #39

Open Abyss-J opened 1 year ago

Abyss-J commented 1 year ago

When I tried to train bevformer2, I used two 3090 GPUs for training and reported an error of ERROR: torch. distributed. final. multiprocessing. api: failed (exitcode: -6) local rank: 1 (pid: 26301). This error does not occur every time, but the probability of occurrence is high. I noticed that the code has already commented that using multi-scale feature will not work. After checking the code, I found that there was an issue with the parameters of VanillaSelfAttention. and SpatialCrossAttention. When using multi-scale features, n_levels needs to be set to the number of multi-scale features of 3 to solve the problem.

aharley commented 1 year ago

Hey, thanks! This is very useful. I will try this later and update the repo. I don't know about that distributed issue -- I haven't encountered that myself...

chg0901 commented 11 months ago

When I tried to train bevformer2, I used two 3090 GPUs for training and reported an error of ERROR: torch. distributed. final. multiprocessing. api: failed (exitcode: -6) local rank: 1 (pid: 26301). This error does not occur every time, but the probability of occurrence is high. I noticed that the code has already commented that using multi-scale feature will not work. After checking the code, I found that there was an issue with the parameters of VanillaSelfAttention. and SpatialCrossAttention. When using multi-scale features, n_levels needs to be set to the number of multi-scale features of 3 to solve the problem.

could you please check this issue

Segnet is used in the train_nuscenes.py, I changed it as from nets.bevformernet2 import Bevformernet and I also changed the model in train_nuscenes.py.

However, I show me an error when it goes bevformernet2.py#L497 .

Could you please share how you run or modify the codes?

Best regards and thank you very much!

chg0901 commented 11 months ago

When I tried to train bevformer2, I used two 3090 GPUs for training and reported an error of ERROR: torch. distributed. final. multiprocessing. api: failed (exitcode: -6) local rank: 1 (pid: 26301). This error does not occur every time, but the probability of occurrence is high. I noticed that the code has already commented that using multi-scale feature will not work. After checking the code, I found that there was an issue with the parameters of VanillaSelfAttention. and SpatialCrossAttention. When using multi-scale features, n_levels needs to be set to the number of multi-scale features of 3 to solve the problem.

could you please check this issue

Segnet is used in the train_nuscenes.py, I changed it as from nets.bevformernet2 import Bevformernet and I also changed the model in train_nuscenes.py.

However, I show me an error when it goes bevformernet2.py#L497 .

Could you please share how you run or modify the codes?

Best regards and thank you very much!

@Abyss-J could you please help me to do this?

Abyss-J commented 11 months ago

Well, I just checked the code for bevformernet2, for some reuse, and I didn't reproduce the author's experiment. If you reported an error in the encoder section, maybe you need to check the input, hoping it's useful.

chg0901 commented 11 months ago

@Abyss-J Thank you for your reply. And could you please share how you try to run the code of bevformernet2 and the python config methods? I want to check it. I can only run the bevformernet experiment.