Closed IQ17 closed 6 years ago
Hi,
One epoch is too small, usually some improvements start to appear after 4-5 epochs
@ducha-aiki Hi, I kept on training for 7 epoches but the losses went even higher, and the Test on graf1-6 remained poor, so I stopped training and looked into the code.
I have two questions 1) a code question: here you multiply rotmat twice, should not one of them been inv_rotmat ?
2) I saved the patches generated during training as shown below, and I dont think out_patches_a_crop and out_patches_p_crop are informative enough for training, they are too vague and a big part of original patches are lost.
data_a data_p data_a_aff data_p_aff data_a_aff_crop data_p_aff_crop out_patches_a_crop out_patches_p_crop
Thanks for the questions. I dont have time to investigate code right now, but will come back as fast as I can. Bad test is a bit weird, I
ll check the defaults a bit later,
But I can answer question 2) - they are not that informative, but since we are not learning a descriptor, is doesn`t mean that much. I think, network doing well in that particular case: the diagonal line in the topright is different angle in data_a_aff_crop and data_p_aff_crop, but looks same on out_a and out_p patches. But you are right that may be the scale is need to be reconsidered. The problem is, that larger part of original patch you crop, then smaller is maximum possible augmentation without getting black borders, which completely kill the training.
Hi thanks for the reply!
I agree that we should avoid black borders, but I don't understand why those vague patches can lead to a good AffineNet. We send those vague patches into the hardnet for theirs descriptors and use the descriptors to get the loss. But can descriptors of those vague patches be discriminative and distinguish each other, when the texture in patches are almost smoothed away?
Also, it seems during training, only a UpIsUp affine matrix is estimated, and the orientation is not estimated and left to the descriptor, why?
I post more examples below, as well as the out_a_aff_back and out_p_aff_back,
(Pdb) out_a_aff_back tensor([[[ 0.1842, -0.9080], [ 0.8869, 1.0571]],
[[-0.6931, -0.3653],
[ 1.0236, -0.9033]],
[[ 0.1809, -1.1663],
[ 1.0450, -1.2088]],
...,
[[ 0.3578, -0.3226],
[ 1.9768, 1.0126]],
[[ 0.9428, -1.3962],
[-0.1599, 1.2975]],
[[ 0.0073, -0.2650],
[ 3.7845, -0.4105]]], device='cuda:0', grad_fn=<BmmBackward>)
(Pdb) out_p_aff_back tensor([[[ 0.1885, -0.9291], [ 0.8500, 1.1153]],
[[-0.2728, -0.1438],
[ 3.6142, -1.7611]],
[[ 0.1283, -0.8270],
[ 1.0229, 1.2003]],
...,
[[ 0.5456, -0.4919],
[ 0.7592, 1.1485]],
[[ 0.7269, -1.0765],
[ 1.9099, -1.4527]],
[[ 0.0074, -0.2710],
[ 3.7002, -0.3504]]], device='cuda:0', grad_fn=<BmmBackward>)
[[ 0.1842, -0.9080],
[ 0.8869, 1.0571]],
[[ 0.1885, -0.9291],
[ 0.8500, 1.1153]],
[[-0.6931, -0.3653],
[ 1.0236, -0.9033]],
[[-0.2728, -0.1438],
[ 3.6142, -1.7611]],
[[ 0.0073, -0.2650],
[ 3.7845, -0.4105]]
[[ 0.0074, -0.2710],
[ 3.7002, -0.3504]]
data_a1023 data_p1023 data_a_aff1023 data_p_aff1023 data_a_aff_crop1023 data_p_aff_crop1023 out_patches_a_crop1023 out_patches_p_crop1023
@IQ17 So far I have cloned this repo on new machine, run run_me.sh, discovered couple of errors and fixed them in the latest commit.
So far my training from running run_me.sh is as following to epoch 10 so far:
Test on graf1-6, 269 tentatives 50 true matches 0.185 inl.ratio Now native ori
So it is definitely learns fine. My suspect is somehow different rng version or pytorch....or may be wrong config. Have you run run_me.sh or did this manually?
Regarding patch extraction (this and other issue), I will look into it and tell you the results :)
Wow, this is a wonderful result! exactly what I saw with your pretrained model.
I just used run_me.sh (and fixed a little bugs like you) but my true matches was never higher than 15... and training was slow: 33 hours per epoch (one 1080Ti gpu and xeon E5 cpu).
I just re clone the repository and try again, thanks for your help!
33 hours per epoch is definitely completely broken. Sorry for stupid question, but do you have mkl and cudnn enabled?
Yes I have mkl and cudnn
import mkl mkl.get_version_string() u'Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications'
cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
--
Such a low speed is indeed strange, I am checking if something is wrong
It is very strange, when training, only one cpu core goes to 100% while other cores are doing almost nothing, and the gpu load is also low, ~2% most of the time. But when try this all my cpu can be 100%, and when use gpu, gpu went to 100% too.
I tried to add mkl.set_num_threads(36) and torch.set_num_threads(36) things before training code, but useless....
Sorry to bother you, but do you have any suspect, or do you have any env settings for pytorch?
I don`t have any env setting...Do you have only one version of python and pytorch? Sometimes they can have weird clashes. You also could try to increase num of workers https://github.com/ducha-aiki/affnet/blob/master/train_AffNet_test_on_graffity.py#L56
Thanks for the help! Although I dont know why, I identified that most of time costs are due to inv_TA, by setting inv_TA=None and hence ignore the geom_dist, I got totally >33 times faster (from 11.8 s/it to 3.21 it/s), so now one epoch will take < 1 hours, so training can be done in one day.
Thanks a lot!
Wow. Good to know, thanks!
I got some good test results, thanks!
By the way, for people who training this repo first time, do NOT judge training process from the loss, but look test result. See my losses (I set geom_dist to a constant 10.0)
Train Epoch: 0 [0/10000000 (0%)] Loss: 1.1791, 10.0000,1.1822: : 0it [00:02, ?it/s]
Train Epoch: 6 [5355520/10000000 (54%)] Loss: 0.9872, 10.0000,0.9762: : 5232it [26:58, 3.23it/s]
Train Epoch: 10 [4618240/10000000 (46%)] Loss: 1.0242, 10.0000,1.0073: : 4520it [23:29, 3.21it/s]
Test epoch -1 Test on graf1-6, 217 tentatives 11 true matches 0.050 inl.ratio
Test epoch 0 Test on graf1-6, 166 tentatives 12 true matches 0.072 inl.ratio
Test epoch 1 Test on graf1-6, 250 tentatives 39 true matches 0.156 inl.ratio
Test epoch 2 Test on graf1-6, 261 tentatives 48 true matches 0.183 inl.ratio
Test epoch 3 Test on graf1-6, 272 tentatives 62 true matches 0.227 inl.ratio
Test epoch 4 Test on graf1-6, 288 tentatives 73 true matches 0.253 inl.ratio
Test epoch 5 Test on graf1-6, 283 tentatives 44 true matches 0.155 inl.ratio
Test epoch 6 Test on graf1-6, 286 tentatives 67 true matches 0.234 inl.ratio
Test epoch 7 Test on graf1-6, 270 tentatives 56 true matches 0.207 inl.ratio
Test epoch 8 Test on graf1-6, 286 tentatives 69 true matches 0.241 inl.ratio
Test epoch 9 Test on graf1-6, 274 tentatives 54 true matches 0.197 inl.ratio
That's great! Thank you for posting :) P.S. @IQ17 I have commented out inv_TA part from the main repo, so others would not catch such terrific slowdown.
I got some good test results, thanks!
By the way, for people who training this repo first time, do NOT judge training process from the loss, but look test result. See my losses (I set geom_dist to a constant 10.0) Train Epoch: 0 [0/10000000 (0%)] Loss: 1.1791, 10.0000,1.1822: : 0it [00:02, ?it/s] Train Epoch: 6 [5355520/10000000 (54%)] Loss: 0.9872, 10.0000,0.9762: : 5232it [26:58, 3.23it/s] Train Epoch: 10 [4618240/10000000 (46%)] Loss: 1.0242, 10.0000,1.0073: : 4520it [23:29, 3.21it/s]
Test epoch -1 Test on graf1-6, 217 tentatives 11 true matches 0.050 inl.ratio
Test epoch 0 Test on graf1-6, 166 tentatives 12 true matches 0.072 inl.ratio
Test epoch 1 Test on graf1-6, 250 tentatives 39 true matches 0.156 inl.ratio
Test epoch 2 Test on graf1-6, 261 tentatives 48 true matches 0.183 inl.ratio
Test epoch 3 Test on graf1-6, 272 tentatives 62 true matches 0.227 inl.ratio
Test epoch 4 Test on graf1-6, 288 tentatives 73 true matches 0.253 inl.ratio
Test epoch 5 Test on graf1-6, 283 tentatives 44 true matches 0.155 inl.ratio
Test epoch 6 Test on graf1-6, 286 tentatives 67 true matches 0.234 inl.ratio
Test epoch 7 Test on graf1-6, 270 tentatives 56 true matches 0.207 inl.ratio
Test epoch 8 Test on graf1-6, 286 tentatives 69 true matches 0.241 inl.ratio
Test epoch 9 Test on graf1-6, 274 tentatives 54 true matches 0.197 inl.ratio
Hi, could you show me the details where you modify the code to get this good result?
@ym547559398 no modification should be needed.
Hi, thanks for the repo!
I wants to train the network, so I just call run_me.sh without any change. (but with pytorch 0.4.1) but the process is so slow (and the gpu load is very very low), and the loss seems not changing much... the losses are not decreasing, and the test results are worse. So I would like to ask if the training process looks ok ?
below are part of training and validation logs
for the epoch -1
train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use
with torch.no_grad():
instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Usewith torch.no_grad():
instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) 0.641245126724 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.101545810699 pe_time 0.0597500801086 0.166574954987 affine shape iters 0.0878648757935 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0334780216217 pe_time 0.0607059001923 0.108034133911 affine shape iters Test epoch -1 Test on graf1-6, 217 tentatives 11 true matches 0.050 inl.ratio Now native ori 0.066300868988 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0342180728912 pe_time 0.0577509403229 0.11149096489 affine shape iters 0.101871013641 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0336909294128 pe_time 0.0553169250488 0.103847026825 affine shape iters Test epoch -1 Test on ori graf1-6, 147 tentatives 10 true matches 0.068 inl.ratiofor the epoch 0
Train Epoch: 0 [9984000/10000000 (100%)] Loss: 0.9201, 1.5074,0.9073: : 9760it [32:02:26, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9760it [32:02:38, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9761it [32:02:38, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9762it [32:02:49, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9763it [32:03:01, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9764it [32:03:12, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9765it [32:03:24, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9766it [32:03:32, 11.82s/it] train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use
with torch.no_grad():
instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Usewith torch.no_grad():
instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) 0.0655670166016 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0328919887543 pe_time 0.0553648471832 0.103418111801 affine shape iters 0.0645890235901 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0329079627991 pe_time 0.0524799823761 0.100947141647 affine shape iters Test epoch 0 Test on graf1-6, 183 tentatives 13 true matches 0.071 inl.ratio Now native ori 0.0709731578827 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.033175945282 pe_time 0.0535531044006 0.103495836258 affine shape iters 0.100589036942 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0331048965454 pe_time 0.0523760318756 0.100074052811 affine shape iters Test epoch 0 Test on ori graf1-6, 155 tentatives 9 true matches 0.058 inl.ratiofor the epoch 1 Train Epoch: 1 [9984000/10000000 (100%)] Loss: 0.9505, 2.0144,0.9437: : 9759it [33:31:50, 12.37s/it] Train Epoch: 1 [9984000/10000000 (100%)] Loss: 0.9505, 2.0144,0.9437: : 9760it [33:32:02, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9760it [33:32:14, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9761it [33:32:14, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9762it [33:32:26, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9763it [33:32:39, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9764it [33:32:51, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9765it [33:33:03, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9766it [33:33:11, 12.37s/it] train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use
with torch.no_grad():
instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Usewith torch.no_grad():
instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) 0.0826170444489 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0436849594116 pe_time 0.0609710216522 0.110808134079 affine shape iters 0.0830068588257 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0333650112152 pe_time 0.054986000061 0.103302001953 affine shape iters Test epoch 1 Test on graf1-6, 165 tentatives 6 true matches 0.036 inl.ratio Now native ori 0.066458940506 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0339682102203 pe_time 0.0546021461487 0.103893041611 affine shape iters 0.104158878326 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0339379310608 pe_time 0.0541059970856 0.104035139084 affine shape iters /home/iouiwc/anaconda2/envs/pytorch/lib/python2.7/site-packages/matplotlib/pyplot.py:537: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (matplotlib.pyplot.figure
) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParamfigure.max_open_warning
). max_open_warning, RuntimeWarning) Test epoch 1 Test on ori graf1-6, 148 tentatives 9 true matches 0.060 inl.ratio