ducha-aiki / affnet

Code and weights for local feature affine shape estimation paper "Repeatability Is Not Enough: Learning Discriminative Affine Regions via Discriminability"
MIT License
266 stars 47 forks source link

does my training process look ok?? #10

Closed IQ17 closed 6 years ago

IQ17 commented 6 years ago

Hi, thanks for the repo!

I wants to train the network, so I just call run_me.sh without any change. (but with pytorch 0.4.1) but the process is so slow (and the gpu load is very very low), and the loss seems not changing much... the losses are not decreasing, and the test results are worse. So I would like to ask if the training process looks ok ?

below are part of training and validation logs

for the epoch -1

train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) 0.641245126724 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.101545810699 pe_time 0.0597500801086 0.166574954987 affine shape iters 0.0878648757935 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0334780216217 pe_time 0.0607059001923 0.108034133911 affine shape iters Test epoch -1 Test on graf1-6, 217 tentatives 11 true matches 0.050 inl.ratio Now native ori 0.066300868988 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0342180728912 pe_time 0.0577509403229 0.11149096489 affine shape iters 0.101871013641 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0336909294128 pe_time 0.0553169250488 0.103847026825 affine shape iters Test epoch -1 Test on ori graf1-6, 147 tentatives 10 true matches 0.068 inl.ratio

for the epoch 0

Train Epoch: 0 [9984000/10000000 (100%)] Loss: 0.9201, 1.5074,0.9073: : 9760it [32:02:26, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9760it [32:02:38, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9761it [32:02:38, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9762it [32:02:49, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9763it [32:03:01, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9764it [32:03:12, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9765it [32:03:24, 11.82s/it] Train Epoch: 0 [9994240/10000000 (100%)] Loss: 0.9484, 1.5387,0.9369: : 9766it [32:03:32, 11.82s/it] train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) 0.0655670166016 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0328919887543 pe_time 0.0553648471832 0.103418111801 affine shape iters 0.0645890235901 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0329079627991 pe_time 0.0524799823761 0.100947141647 affine shape iters Test epoch 0 Test on graf1-6, 183 tentatives 13 true matches 0.071 inl.ratio Now native ori 0.0709731578827 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.033175945282 pe_time 0.0535531044006 0.103495836258 affine shape iters 0.100589036942 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0331048965454 pe_time 0.0523760318756 0.100074052811 affine shape iters Test epoch 0 Test on ori graf1-6, 155 tentatives 9 true matches 0.058 inl.ratio

for the epoch 1 Train Epoch: 1 [9984000/10000000 (100%)] Loss: 0.9505, 2.0144,0.9437: : 9759it [33:31:50, 12.37s/it] Train Epoch: 1 [9984000/10000000 (100%)] Loss: 0.9505, 2.0144,0.9437: : 9760it [33:32:02, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9760it [33:32:14, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9761it [33:32:14, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9762it [33:32:26, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9763it [33:32:39, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9764it [33:32:51, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9765it [33:33:03, 12.37s/it] Train Epoch: 1 [9994240/10000000 (100%)] Loss: 0.9703, 1.9384,0.9606: : 9766it [33:33:11, 12.37s/it] train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) train_AffNet_test_on_graffity.py:250: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. var_image = torch.autograd.Variable(torch.from_numpy(img.astype(np.float32)), volatile = True) 0.0826170444489 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0436849594116 pe_time 0.0609710216522 0.110808134079 affine shape iters 0.0830068588257 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0333650112152 pe_time 0.054986000061 0.103302001953 affine shape iters Test epoch 1 Test on graf1-6, 165 tentatives 6 true matches 0.036 inl.ratio Now native ori 0.066458940506 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0339682102203 pe_time 0.0546021461487 0.103893041611 affine shape iters 0.104158878326 detection multiscale /media/iouiwc/0596f94c-b314-4162-80b4-79b3a602c9a2/iouiwc/github/affnet/SparseImgRepresenter.py:151: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number if (num_features > 0) and (num_survived.data[0] > num_features): affnet_time 0.0339379310608 pe_time 0.0541059970856 0.104035139084 affine shape iters /home/iouiwc/anaconda2/envs/pytorch/lib/python2.7/site-packages/matplotlib/pyplot.py:537: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (matplotlib.pyplot.figure) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam figure.max_open_warning). max_open_warning, RuntimeWarning) Test epoch 1 Test on ori graf1-6, 148 tentatives 9 true matches 0.060 inl.ratio

ducha-aiki commented 6 years ago

Hi,

One epoch is too small, usually some improvements start to appear after 4-5 epochs

IQ17 commented 6 years ago

@ducha-aiki Hi, I kept on training for 7 epoches but the losses went even higher, and the Test on graf1-6 remained poor, so I stopped training and looked into the code.

I have two questions 1) a code question: here you multiply rotmat twice, should not one of them been inv_rotmat ?

2) I saved the patches generated during training as shown below, and I dont think out_patches_a_crop and out_patches_p_crop are informative enough for training, they are too vague and a big part of original patches are lost.

data_adata_a data_pdata_p data_a_affdata_a_aff data_p_affdata_p_aff data_a_aff_cropdata_a_aff_crop data_p_aff_cropdata_p_aff_crop out_patches_a_cropout_patches_a_crop out_patches_p_cropout_patches_p_crop

ducha-aiki commented 6 years ago

Thanks for the questions. I dont have time to investigate code right now, but will come back as fast as I can. Bad test is a bit weird, Ill check the defaults a bit later,

But I can answer question 2) - they are not that informative, but since we are not learning a descriptor, is doesn`t mean that much. I think, network doing well in that particular case: the diagonal line in the topright is different angle in data_a_aff_crop and data_p_aff_crop, but looks same on out_a and out_p patches. But you are right that may be the scale is need to be reconsidered. The problem is, that larger part of original patch you crop, then smaller is maximum possible augmentation without getting black borders, which completely kill the training.

IQ17 commented 6 years ago

Hi thanks for the reply!

I agree that we should avoid black borders, but I don't understand why those vague patches can lead to a good AffineNet. We send those vague patches into the hardnet for theirs descriptors and use the descriptors to get the loss. But can descriptors of those vague patches be discriminative and distinguish each other, when the texture in patches are almost smoothed away?

Also, it seems during training, only a UpIsUp affine matrix is estimated, and the orientation is not estimated and left to the descriptor, why?

I post more examples below, as well as the out_a_aff_back and out_p_aff_back,

(Pdb) out_a_aff_back tensor([[[ 0.1842, -0.9080], [ 0.8869, 1.0571]],

    [[-0.6931, -0.3653],
     [ 1.0236, -0.9033]],

    [[ 0.1809, -1.1663],
     [ 1.0450, -1.2088]],

    ...,

    [[ 0.3578, -0.3226],
     [ 1.9768,  1.0126]],

    [[ 0.9428, -1.3962],
     [-0.1599,  1.2975]],

    [[ 0.0073, -0.2650],
     [ 3.7845, -0.4105]]], device='cuda:0', grad_fn=<BmmBackward>)

(Pdb) out_p_aff_back tensor([[[ 0.1885, -0.9291], [ 0.8500, 1.1153]],

    [[-0.2728, -0.1438],
     [ 3.6142, -1.7611]],

    [[ 0.1283, -0.8270],
     [ 1.0229,  1.2003]],

    ...,

    [[ 0.5456, -0.4919],
     [ 0.7592,  1.1485]],

    [[ 0.7269, -1.0765],
     [ 1.9099, -1.4527]],

    [[ 0.0074, -0.2710],
     [ 3.7002, -0.3504]]], device='cuda:0', grad_fn=<BmmBackward>)

[[ 0.1842, -0.9080],
[ 0.8869, 1.0571]], 

[[ 0.1885, -0.9291],
[ 0.8500,  1.1153]],

data_a0data_a0 data_p0data_p0 data_a_aff0data_a_aff0 data_p_aff0data_p_aff0 data_a_aff_crop0data_a_aff_crop0 data_p_aff_crop0data_p_aff_crop0 out_patches_a_crop0out_patches_a_crop0 out_patches_p_crop0out_patches_p_crop0

    [[-0.6931, -0.3653],
     [ 1.0236, -0.9033]],

    [[-0.2728, -0.1438],
     [ 3.6142, -1.7611]],

data_a1data_a1 data_p1data_p1 data_a_aff1data_a_aff1 data_p_aff1data_p_aff1 data_a_aff_crop1data_a_aff_crop1 data_p_aff_crop1data_p_aff_crop1 out_patches_a_crop1out_patches_a_crop1 out_patches_p_crop1out_patches_p_crop1

    [[ 0.0073, -0.2650],
     [ 3.7845, -0.4105]]

    [[ 0.0074, -0.2710],
     [ 3.7002, -0.3504]]

data_a1023data_a1023 data_p1023data_p1023 data_a_aff1023data_a_aff1023 data_p_aff1023data_p_aff1023 data_a_aff_crop1023data_a_aff_crop1023 data_p_aff_crop1023data_p_aff_crop1023 out_patches_a_crop1023out_patches_a_crop1023 out_patches_p_crop1023out_patches_p_crop1023

ducha-aiki commented 6 years ago

@IQ17 So far I have cloned this repo on new machine, run run_me.sh, discovered couple of errors and fixed them in the latest commit.

So far my training from running run_me.sh is as following to epoch 10 so far:

old-ufo@oldufo-ubuntupc:~/storage/test_affne/affnet$ cat try_run.log | grep "Now nati" -B1 Test on graf1-6, 217 tentatives 11 true matches 0.050 inl.ratio Now native ori

Test on graf1-6, 188 tentatives 15 true matches 0.079 inl.ratio Now native ori

Test on graf1-6, 260 tentatives 31 true matches 0.119 inl.ratio Now native ori

Test on graf1-6, 280 tentatives 59 true matches 0.210 inl.ratio Now native ori

Test on graf1-6, 281 tentatives 49 true matches 0.174 inl.ratio Now native ori

Test on graf1-6, 295 tentatives 47 true matches 0.159 inl.ratio Now native ori

Test on graf1-6, 285 tentatives 50 true matches 0.175 inl.ratio Now native ori

Test on graf1-6, 308 tentatives 59 true matches 0.191 inl.ratio Now native ori

Test on graf1-6, 303 tentatives 62 true matches 0.204 inl.ratio Now native ori

Test on graf1-6, 278 tentatives 58 true matches 0.208 inl.ratio Now native ori

Test on graf1-6, 269 tentatives 50 true matches 0.185 inl.ratio Now native ori

So it is definitely learns fine. My suspect is somehow different rng version or pytorch....or may be wrong config. Have you run run_me.sh or did this manually?

Regarding patch extraction (this and other issue), I will look into it and tell you the results :)

IQ17 commented 6 years ago

Wow, this is a wonderful result! exactly what I saw with your pretrained model.

I just used run_me.sh (and fixed a little bugs like you) but my true matches was never higher than 15... and training was slow: 33 hours per epoch (one 1080Ti gpu and xeon E5 cpu).

I just re clone the repository and try again, thanks for your help!

ducha-aiki commented 6 years ago

33 hours per epoch is definitely completely broken. Sorry for stupid question, but do you have mkl and cudnn enabled?

IQ17 commented 6 years ago

Yes I have mkl and cudnn

import mkl mkl.get_version_string() u'Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications'

cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2

define CUDNN_MAJOR 7

define CUDNN_MINOR 1

define CUDNN_PATCHLEVEL 3

--

define CUDNN_VERSION (CUDNN_MAJOR 1000 + CUDNN_MINOR 100 + CUDNN_PATCHLEVEL)

include "driver_types.h"

Such a low speed is indeed strange, I am checking if something is wrong

IQ17 commented 6 years ago

It is very strange, when training, only one cpu core goes to 100% while other cores are doing almost nothing, and the gpu load is also low, ~2% most of the time. But when try this all my cpu can be 100%, and when use gpu, gpu went to 100% too.

I tried to add mkl.set_num_threads(36) and torch.set_num_threads(36) things before training code, but useless....

Sorry to bother you, but do you have any suspect, or do you have any env settings for pytorch?

ducha-aiki commented 6 years ago

I don`t have any env setting...Do you have only one version of python and pytorch? Sometimes they can have weird clashes. You also could try to increase num of workers https://github.com/ducha-aiki/affnet/blob/master/train_AffNet_test_on_graffity.py#L56

IQ17 commented 6 years ago

Thanks for the help! Although I dont know why, I identified that most of time costs are due to inv_TA, by setting inv_TA=None and hence ignore the geom_dist, I got totally >33 times faster (from 11.8 s/it to 3.21 it/s), so now one epoch will take < 1 hours, so training can be done in one day.

Thanks a lot!

ducha-aiki commented 6 years ago

Wow. Good to know, thanks!

IQ17 commented 6 years ago

I got some good test results, thanks!

By the way, for people who training this repo first time, do NOT judge training process from the loss, but look test result. See my losses (I set geom_dist to a constant 10.0)

Train Epoch: 0 [0/10000000 (0%)]    Loss: 1.1791, 10.0000,1.1822: : 0it [00:02, ?it/s]
Train Epoch: 6 [5355520/10000000 (54%)] Loss: 0.9872, 10.0000,0.9762: : 5232it [26:58,  3.23it/s]
Train Epoch: 10 [4618240/10000000 (46%)]    Loss: 1.0242, 10.0000,1.0073: : 4520it [23:29,  3.21it/s]

Test epoch -1 Test on graf1-6, 217 tentatives 11 true matches 0.050 inl.ratio

Test epoch 0 Test on graf1-6, 166 tentatives 12 true matches 0.072 inl.ratio

Test epoch 1 Test on graf1-6, 250 tentatives 39 true matches 0.156 inl.ratio

Test epoch 2 Test on graf1-6, 261 tentatives 48 true matches 0.183 inl.ratio

Test epoch 3 Test on graf1-6, 272 tentatives 62 true matches 0.227 inl.ratio

Test epoch 4 Test on graf1-6, 288 tentatives 73 true matches 0.253 inl.ratio

Test epoch 5 Test on graf1-6, 283 tentatives 44 true matches 0.155 inl.ratio

Test epoch 6 Test on graf1-6, 286 tentatives 67 true matches 0.234 inl.ratio

Test epoch 7 Test on graf1-6, 270 tentatives 56 true matches 0.207 inl.ratio

Test epoch 8 Test on graf1-6, 286 tentatives 69 true matches 0.241 inl.ratio

Test epoch 9 Test on graf1-6, 274 tentatives 54 true matches 0.197 inl.ratio

ducha-aiki commented 6 years ago

That's great! Thank you for posting :) P.S. @IQ17 I have commented out inv_TA part from the main repo, so others would not catch such terrific slowdown.

feymanpriv commented 5 years ago

I got some good test results, thanks!

By the way, for people who training this repo first time, do NOT judge training process from the loss, but look test result. See my losses (I set geom_dist to a constant 10.0)

Train Epoch: 0 [0/10000000 (0%)]  Loss: 1.1791, 10.0000,1.1822: : 0it [00:02, ?it/s]
Train Epoch: 6 [5355520/10000000 (54%)]   Loss: 0.9872, 10.0000,0.9762: : 5232it [26:58,  3.23it/s]
Train Epoch: 10 [4618240/10000000 (46%)]  Loss: 1.0242, 10.0000,1.0073: : 4520it [23:29,  3.21it/s]

Test epoch -1 Test on graf1-6, 217 tentatives 11 true matches 0.050 inl.ratio

Test epoch 0 Test on graf1-6, 166 tentatives 12 true matches 0.072 inl.ratio

Test epoch 1 Test on graf1-6, 250 tentatives 39 true matches 0.156 inl.ratio

Test epoch 2 Test on graf1-6, 261 tentatives 48 true matches 0.183 inl.ratio

Test epoch 3 Test on graf1-6, 272 tentatives 62 true matches 0.227 inl.ratio

Test epoch 4 Test on graf1-6, 288 tentatives 73 true matches 0.253 inl.ratio

Test epoch 5 Test on graf1-6, 283 tentatives 44 true matches 0.155 inl.ratio

Test epoch 6 Test on graf1-6, 286 tentatives 67 true matches 0.234 inl.ratio

Test epoch 7 Test on graf1-6, 270 tentatives 56 true matches 0.207 inl.ratio

Test epoch 8 Test on graf1-6, 286 tentatives 69 true matches 0.241 inl.ratio

Test epoch 9 Test on graf1-6, 274 tentatives 54 true matches 0.197 inl.ratio

Hi, could you show me the details where you modify the code to get this good result?

ducha-aiki commented 5 years ago

@ym547559398 no modification should be needed.