fabro66 / GAST-Net-3DPoseEstimation

A Graph Attention Spatio-temporal Convolutional Networks for 3D Human Pose Estimation in Video (GAST-Net)
MIT License
313 stars 70 forks source link

What's the GPUs setup for 243-frame model? #4

Closed luzzou closed 4 years ago

luzzou commented 4 years ago

Hi~ I'm wondering about the GPUs setup for training the 243 frames model, and how long it takes for training?

luzzou commented 4 years ago

By the way, it seems that the model configuration depends on the number of input frames, and the model parameters (T=81) counted by the releasing code are ~28M, it is not consistent with your paper (version#1, Table#2: 7.05M). There doesn't seem to be much difference between the two versions, since the quantitative results in Table#1(version#1) and Table#2 (version#2) are the same.

In addition, have you tested the generalization ability of the network?

Looking forward for your reply.

fabro66 commented 4 years ago

Hi~ I'm wondering about the GPUs setup for training the 243 frames model, and how long it takes for training?

Hi~ Thank you for your interest in our works. We spent about 2 days training the 243-frame model under two Titan RTX. But it is easy to reach saturation, so it can be stopped after training about 30 iterations. Due to running a receptive field of this size under multi-head global attention add a lot of computing time. Our latest works replace the multi-head global attention with single-head attention, as well as, adding multi-scale temporal information. The new model is not only faster, has fewer parameters, and the accuracy is also improved.

luzzou commented 4 years ago

Hi~ I'm wondering about the GPUs setup for training the 243 frames model, and how long it takes for training?

Hi~ Thank you for your interest in our works. We spent about 2 days training the 243-frame model under two Titan RTX. But it is easy to reach saturation, so it can be stopped after training about 30 iterations. Due to running a receptive field of this size under multi-head global attention add a lot of computing time. Our latest works replace the multi-head global attention with single-head attention, as well as, adding multi-scale temporal information. The new model is not only faster, has fewer parameters, and the accuracy is also improved.

Thank you for your reply, that helps me a lot! The new work sounds interesting, will it be released later? BTW, an arXiv work "motion guided 3d pose estimation from videos" also contributes in multi-scale temporal modeling, is there any difference? Thanks.

fabro66 commented 4 years ago

By the way, it seems that the model configuration depends on the number of input frames, and the model parameters (T=81) counted by the releasing code are ~28M, it is not consistent with your paper (version#1, Table#2: 7.05M). There doesn't seem to be much difference between the two versions, since the quantitative results in Table#1(version#1) and Table#2 (version#2) are the same.

In addition, have you tested the generalization ability of the network?

Looking forward for your reply.

Q1: Model parameter To make the model lightweight, for networks with receptive fields of 9 and 27, we increase the number of output channels of the first dilated convolutional layer to 128, while the network with 81 and 243 receptive fields is set to 64 and 32channels respectively (Section 4. Experiment implementation detail). 9-frame model 128 (1.62M) 27-frame model 128 (6.92M) 81-frame model 64 (7.05M) 243-frame model 32 (7.09M)

Q2: Generalization Our method generalizes well in the wild. Please take a look at our introduction video.YouTube, Youku

fabro66 commented 4 years ago

Hi~ I'm wondering about the GPUs setup for training the 243 frames model, and how long it takes for training?

Hi~ Thank you for your interest in our works. We spent about 2 days training the 243-frame model under two Titan RTX. But it is easy to reach saturation, so it can be stopped after training about 30 iterations. Due to running a receptive field of this size under multi-head global attention add a lot of computing time. Our latest works replace the multi-head global attention with single-head attention, as well as, adding multi-scale temporal information. The new model is not only faster, has fewer parameters, and the accuracy is also improved.

Thank you for your reply, that helps me a lot! The new work sounds interesting, will it be released later? BTW, an arXiv work "motion guided 3d pose estimation from videos" also contributes in multi-scale temporal modeling, is there any difference? Thanks.

Hi~ Our work is different from the article you mentioned. In the feature, we will release the latest method.

luzzou commented 4 years ago

Hi~ I'm wondering about the GPUs setup for training the 243 frames model, and how long it takes for training?

Hi~ Thank you for your interest in our works. We spent about 2 days training the 243-frame model under two Titan RTX. But it is easy to reach saturation, so it can be stopped after training about 30 iterations. Due to running a receptive field of this size under multi-head global attention add a lot of computing time. Our latest works replace the multi-head global attention with single-head attention, as well as, adding multi-scale temporal information. The new model is not only faster, has fewer parameters, and the accuracy is also improved.

Thank you for your reply, that helps me a lot! The new work sounds interesting, will it be released later? BTW, an arXiv work "motion guided 3d pose estimation from videos" also contributes in multi-scale temporal modeling, is there any difference? Thanks.

Hi~ Our work is different from the article you mentioned. In the feature, we will release the latest method.

Thank you for your timely and kindly reply, all the questions have been solved! Looking forward to your new work!