Haiyang-W / UniTR

[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"
https://arxiv.org/abs/2308.07732
Apache License 2.0
276 stars 16 forks source link

I meet a cache-related bug #5

Closed RookieXwc closed 11 months ago

RookieXwc commented 11 months ago

Hello haiyang, thank you for your outstanding work in DSVT and UniTR. I can run unitr+lss.yaml normally, but when I run unitr+lss_caching.yaml, the network was able to calculate the loss, but then an error occurred: image

Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad()

This error always occurs in the second iter of the first epoch. Have you ever encountered this bug? It seems to be related to torch. cuda. amp. GradScaler. My GPU is GTX 1080 Ti, is it because it cannot support torch. cuda. amp?

RookieXwc commented 11 months ago

I noticed that only unitr_cache.yaml is used as testing in README. Can you explain in detail the usage and the meaning of this sentence 'You can even cache all mapping calculations during the training phase, which can significantly accelerate your training speed'` Thank you very much.

nnnth commented 11 months ago

Sorry for not explaining clearly, "mapping calculations" specifically refers to the image2lidar and lidar2image modules in UniTR, and does not include the cache calculation of lss. This can be achieved by only setting ACCELERATE to True in the mm_backbone part of the config. Currently, the caching method of lss does not support use during training, because we did not cancel the gradient through torch.no_grad during calculation, resulting in an error in the second iteration of training (the calculation graph of the cache variable in the first iteration has been released after backpropagation). In addition, UniTR does not support fp16 (mixed precision training), which will cause NaN in the loss calculation.

Haiyang-W commented 11 months ago

Sorry for this mistake. We will fix this bug as soon as possible. Now the cache mode only supports the test stage.

Transfusion has problems with fp16 training, try not to use it.

RookieXwc commented 11 months ago

Thank you for your timely response. My understanding is to use unitr+lss.yaml in training and unitr+lss_cache.yaml in testing. Is that correct?

Haiyang-W commented 11 months ago

Thank you for your timely response. My understanding is to use unitr+lss.yaml in training and unitr+lss_cache.yaml in testing. Is that correct?

Yeah, there's no problem with that.

Haiyang-W commented 11 months ago

Please refer to issue#3, someone has reproduced our codebase. The code seems to be fine. Please feel free to contact me if you have any questions.

RookieXwc commented 11 months ago

When I train using unitr+lss.yaml, can I achieve acceleration during evaluation after completing 10 epochs? I think it should not be possible, right? Because ACCELERATE is not specified as True during network initialization. Acceleration can only be achieved when testing with unitr+lss_cache.yaml.

Haiyang-W commented 11 months ago

When I train using unitr+lss.yaml, can I achieve acceleration during evaluation after completing 10 epochs? I think it should not be possible, right? Because ACCELERATE is not specified as True during network initialization. Acceleration can only be achieved when testing with unitr+lss_cache.yaml.

Yeah, you can evaluate the checkpoint trained by unitr+lss.yaml directly by unitr+lss_cache.yaml. ACCELERATE option has nothing to do with network initialization and training. You can try this based on our provided checkpoint. We also provide the expected performance in cache testing section (-0.7 drop).

RookieXwc commented 11 months ago

Thank you for your timely response! I have no further questions.

Haiyang-W commented 11 months ago

Thank you for your timely response! I have no further questions.

Great! Wish you all the best. :)

Haiyang