Closed zc-alexfan closed 3 years ago
Q1. I use the default config.py but do a batch size of 32 with accumulate gradient of 2, which should be equivalent to batch size 64. <- Could you explain more details of this?
Q2. I saw a bug report earlier that the image sizes were swapped. <- when did you download data and annotations files?
Q1. Here is the diff of my accumulated gradient code to your code:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib as mpl
diff --git a/main/config.py b/main/config.py
index 6c1f530..1322c5a 100644
--- a/main/config.py
+++ b/main/config.py
@@ -32,7 +32,7 @@ class Config:
end_epoch = 20 if dataset == 'InterHand2.6M' else 50
lr = 1e-4
lr_dec_factor = 10
- train_batch_size = 16
+ train_batch_size = 32
## testing config
test_batch_size = 32
@@ -49,7 +49,7 @@ class Config:
result_dir = osp.join(output_dir, 'result')
## others
- num_thread = 40
+ num_thread = 8
gpu_ids = '0'
num_gpus = 1
continue_train = False
diff --git a/main/train.py b/main/train.py
index 1036fef..0f3e1f9 100644
--- a/main/train.py
+++ b/main/train.py
@@ -44,7 +44,8 @@ def main():
trainer = Trainer()
trainer._make_batch_generator(args.annot_subset)
trainer._make_model()
-
+ optim_step = False
+
# train
for epoch in range(trainer.start_epoch, cfg.end_epoch):
@@ -56,13 +57,16 @@ def main():
trainer.gpu_timer.tic()
# forward
- trainer.optimizer.zero_grad()
loss = trainer.model(inputs, targets, meta_info, 'train')
loss = {k:loss[k].mean() for k in loss}
# backward
- sum(loss[k] for k in loss).backward()
- trainer.optimizer.step()
+ my_loss = sum(loss[k] for k in loss)/2
+ my_loss.backward()
+ if optim_step:
+ trainer.optimizer.step()
+ trainer.optimizer.zero_grad()
+ optim_step = not optim_step
trainer.gpu_timer.toc()
screen = [
'Epoch %d/%d itr %d/%d:' % (epoch, cfg.end_epoch, itr, trainer.itr_per_epoch),
Q2. I downloaded the dataset in Sep. 11, 2020.
I am pretty sure the accumulated gradient update is correct as I follow the instruction in here. I also have an accumulated gradient version of the code in pytorch lightning, which doesn't have any problem so far.
Q3. Have you verified your code can reproduce by training a new model using the default setting? Just want to check.
I did a bit of follow-up using your latest version of code and dataset:
eeb7346
(does not have my code).python test.py --gpu 0 --test_epoch 19 --test_set val --annot_subset machine_annot
The evaluation result of the H+M trained model is shown below, which does not show 18mm
for interacting hands.
Evaluation start...
Handedness accuracy: 0.9835464015151515
MRRPE: 35.98906321887903
MPJPE for each joint:
r_thumb4: 22.01, r_thumb3: 16.54, r_thumb2: 13.14, r_thumb1: 8.44, r_index4: 26.14, r_index3: 21.99, r_index2: 18.93, r_index1: 14.52, r_middle4: 26.37, r_middle3: 22.13, r_middle2: 19.53, r_middle1: 14.59, r_ring4: 24.76, r_ring3: 20.68, r_ring2: 17.99, r_ring1: 13.56, r_pinky4: 23.39, r_pinky3: 19.80, r_pinky2: 17.62, r_pinky1: 13.00, r_wrist: 0.00, l_thumb4: 22.30, l_thumb3: 17.06, l_thumb2: 13.23, l_thumb1: 8.32, l_index4: 24.67, l_index3: 20.17, l_index2: 17.47, l_index1: 13.67, l_middle4: 25.32, l_middle3: 21.17, l_middle2: 18.63, l_middle1: 13.84, l_ring4: 23.49, l_ring3: 19.67, l_ring2: 17.09, l_ring1: 13.36, l_pinky4: 23.58, l_pinky3: 19.85, l_pinky2: 17.23, l_pinky1: 12.98, l_wrist: 0.00,
MPJPE for all hand sequences: 17.58
MPJPE for each joint:
r_thumb4: 17.84, r_thumb3: 13.74, r_thumb2: 10.40, r_thumb1: 7.13, r_index4: 20.23, r_index3: 17.68, r_index2: 15.64, r_index1: 12.45, r_middle4: 22.57, r_middle3: 19.79, r_middle2: 17.17, r_middle1: 12.42, r_ring4: 21.58, r_ring3: 18.67, r_ring2: 15.69, r_ring1: 11.26, r_pinky4: 20.50, r_pinky3: 17.57, r_pinky2: 15.19, r_pinky1: 10.30, r_wrist: 0.00, l_thumb4: 18.94, l_thumb3: 14.95, l_thumb2: 11.09, l_thumb1: 7.13, l_index4: 19.73, l_index3: 16.11, l_index2: 14.18, l_index1: 11.16, l_middle4: 20.99, l_middle3: 17.85, l_middle2: 15.59, l_middle1: 11.46, l_ring4: 20.14, l_ring3: 16.91, l_ring2: 14.31, l_ring1: 11.02, l_pinky4: 20.50, l_pinky3: 17.24, l_pinky2: 14.70, l_pinky1: 10.63, l_wrist: 0.00,
MPJPE for single hand sequences: 14.82
MPJPE for each joint:
r_thumb4: 26.03, r_thumb3: 19.26, r_thumb2: 15.78, r_thumb1: 10.51, r_index4: 32.39, r_index3: 26.53, r_index2: 22.29, r_index1: 16.52, r_middle4: 30.59, r_middle3: 24.61, r_middle2: 21.96, r_middle1: 16.70, r_ring4: 28.12, r_ring3: 22.81, r_ring2: 20.32, r_ring1: 15.77, r_pinky4: 26.43, r_pinky3: 22.09, r_pinky2: 20.02, r_pinky1: 15.58, r_wrist: 0.00, l_thumb4: 25.99, l_thumb3: 19.31, l_thumb2: 15.53, l_thumb1: 10.19, l_index4: 30.52, l_index3: 24.78, l_index2: 21.05, l_index1: 16.34, l_middle4: 31.16, l_middle3: 25.11, l_middle2: 21.99, l_middle1: 16.37, l_ring4: 28.01, l_ring3: 22.88, l_ring2: 20.11, l_ring1: 15.85, l_pinky4: 27.34, l_pinky3: 22.75, l_pinky2: 19.92, l_pinky1: 15.45, l_wrist: 0.00,
MPJPE for interacting hand sequences: 20.59
Oh I see. Sorry, but there is a testing result from InterNet trained on InterHand2.6M v0.0. The current released version of InterHand2.6M (v0.0) is not a full InterHand2.6M, as described in here. The testing results on InterHand2.6M v0.0 seems similar with your result. Hope this can resolve your question.
Btw, the gradient accumulation you described above is applicable to adam optimizer? I think this can be applicable to SGD optimizer, but not sure about adam optimizer.
Thanks for the quick response despite it is ICCV time. Yes. That matches the numbers, but your paper said "All reported frame numbers and experimental results in the paper are from the 5 fps configuration."
But the IH error for 5 FPS is 20.59mm according to the "testing result" zip file while the number on the paper is 18.58mm.
I assume v0.0
means 5FPS
.
v0.0 is not 5 fps. Let me clarify this.
Full IH2.6M (not released yet because of the data inspection)
IH2.6M v0.0 (released)
All numbers in the paper are from full IH2.6M, which is not released because of the data inspection. Therefore, I additionally provided the training and testing result on v0.0, which is released.
I see. That makes sense. For the 30 FPS version with v0.0, only the annotation is released right? My current understanding for the images is that they are from 5 FPS.
For the 30 FPS version with v0.0, only the annotation is released right?
-> Correct.
I'm going to train the whole InterNet dataset using Colab Pro. Does anyone here have any estimation on how long it would take to complete one epoch?
Hi, I followed the instructions according to the repo to train and reproduce the InterHand performance. While the reported numbers for interacting hand pose validation error is 18.58mm (Table 4), my reproduced number is 20mm. Do you know why there is a discrepancy? I didn't modify anything from the repo to train this.
I saw a bug report earlier that the image sizes were swapped. Would that be the reason? Thanks.
Here are the command I used for training and validation (I guess I should use epoch 19 for testing as the number starts from 0):
I use the default
config.py
but do a batch size of 32 with accumulate gradient of 2, which should be equivalent to batch size 64.