Closed RookieXwc closed 11 months ago
I noticed that only unitr_cache.yaml
is used as testing in README. Can you explain in detail the usage and the meaning of this sentence 'You can even cache all mapping calculations during the training phase, which can significantly accelerate your training speed'` Thank you very much.
Sorry for not explaining clearly, "mapping calculations" specifically refers to the image2lidar
and lidar2image
modules in UniTR, and does not include the cache calculation of lss. This can be achieved by only setting ACCELERATE
to True
in the mm_backbone
part of the config.
Currently, the caching method of lss does not support use during training, because we did not cancel the gradient through torch.no_grad
during calculation, resulting in an error in the second iteration of training (the calculation graph of the cache variable in the first iteration has been released after backpropagation).
In addition, UniTR does not support fp16 (mixed precision training), which will cause NaN in the loss calculation.
Sorry for this mistake. We will fix this bug as soon as possible. Now the cache mode only supports the test stage.
Transfusion has problems with fp16 training, try not to use it.
Thank you for your timely response. My understanding is to use unitr+lss.yaml in training and unitr+lss_cache.yaml in testing. Is that correct?
Thank you for your timely response. My understanding is to use unitr+lss.yaml in training and unitr+lss_cache.yaml in testing. Is that correct?
Yeah, there's no problem with that.
Please refer to issue#3, someone has reproduced our codebase. The code seems to be fine. Please feel free to contact me if you have any questions.
When I train using unitr+lss.yaml
, can I achieve acceleration during evaluation after completing 10 epochs? I think it should not be possible, right? Because ACCELERATE is not specified as True during network initialization. Acceleration can only be achieved when testing with unitr+lss_cache.yaml
.
When I train using
unitr+lss.yaml
, can I achieve acceleration during evaluation after completing 10 epochs? I think it should not be possible, right? Because ACCELERATE is not specified as True during network initialization. Acceleration can only be achieved when testing withunitr+lss_cache.yaml
.
Yeah, you can evaluate the checkpoint trained by unitr+lss.yaml directly by unitr+lss_cache.yaml. ACCELERATE option has nothing to do with network initialization and training. You can try this based on our provided checkpoint. We also provide the expected performance in cache testing section (-0.7 drop).
Thank you for your timely response! I have no further questions.
Thank you for your timely response! I have no further questions.
Great! Wish you all the best. :)
Haiyang
Hello haiyang, thank you for your outstanding work in DSVT and UniTR. I can run unitr+lss.yaml normally, but when I run unitr+lss_caching.yaml, the network was able to calculate the loss, but then an error occurred:
This error always occurs in the second iter of the first epoch. Have you ever encountered this bug? It seems to be related to torch. cuda. amp. GradScaler. My GPU is GTX 1080 Ti, is it because it cannot support torch. cuda. amp?