Training Speed - Githubissues

facebookresearch / PoseDiffusion

[ICCV 2023] PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment

Other

718 stars 42 forks source link

Training Speed #33

Open jytime opened 8 months ago

jytime commented 8 months ago

I happen to find the release training code seems to be super slow compared to the original (internal) implementation when training on 8GPUs. It seems the single GPU training does not suffer from this. Mark it here and delve later

jytime commented 8 months ago

This looks like because accelerate is not set up correctly and hence data loading is 10x slower. Put this issue here in case someone may meet the problem. The number of sec/it in the log indicates the time used for each training step. It should be within 1-3 seconds. If a training step takes more than this value, usually there is something wrong.

If someone meets this problem, the simplest solution may be to use pytorch's own distributed training and remove accelerate/accelerator in our training code.

sungh66 commented 8 months ago

Hi,i recently did the reproduction of this article. On 41 class of CO3d data, the highest racc_15 of the training set is 0.93, the tacc_15 is close to 0.8, and the speed is 0.8sec/it. Is this result normal?

jytime commented 8 months ago

Hi @sungh66 the result looks good. In my own logs, the tacc_15 during training should be slightly higher, close to 0.9. But it should be fine as long as the testing result is consistent, because the accuracy during training is highly affected by the degree of data augmentation.

sungh66 commented 6 months ago

Hi， @jytime Does normal inference time have to include the time to load superglue models, the time to extract and match features? I inference 200 pictures at a time, and the time for this part is close to 40 minutes,it is too long. Is it possible to load the model only once to inference different videos?

jytime commented 6 months ago

Hey you could have a try on lightglue instead of superglue, as here:

https://github.com/facebookresearch/PoseDiffusion/blob/41d1cf89dc9fa8bfa134ae511bffcab84094dd83/pose_diffusion/util/match_extraction.py#L92

matcher_conf = match_features.confs["superpoint+lightglue"]

It should basically give a similar result while be 2x or 3x faster.