Open jytime opened 8 months ago
This looks like because accelerate
is not set up correctly and hence data loading is 10x slower. Put this issue here in case someone may meet the problem. The number of sec/it
in the log indicates the time used for each training step. It should be within 1-3 seconds. If a training step takes more than this value, usually there is something wrong.
If someone meets this problem, the simplest solution may be to use pytorch's own distributed training and remove accelerate
/accelerator
in our training code.
Hi,i recently did the reproduction of this article. On 41 class of CO3d data, the highest racc_15 of the training set is 0.93, the tacc_15 is close to 0.8, and the speed is 0.8sec/it. Is this result normal?
Hi @sungh66 the result looks good. In my own logs, the tacc_15 during training should be slightly higher, close to 0.9. But it should be fine as long as the testing result is consistent, because the accuracy during training is highly affected by the degree of data augmentation.
Hi, @jytime Does normal inference time have to include the time to load superglue models, the time to extract and match features? I inference 200 pictures at a time, and the time for this part is close to 40 minutes,it is too long. Is it possible to load the model only once to inference different videos?
Hey you could have a try on lightglue instead of superglue, as here:
to
matcher_conf = match_features.confs["superpoint+lightglue"]
It should basically give a similar result while be 2x or 3x faster.
I happen to find the release training code seems to be super slow compared to the original (internal) implementation when training on 8GPUs. It seems the single GPU training does not suffer from this. Mark it here and delve later