PruneTruong / DenseMatching

Dense matching library based on PyTorch
GNU Lesser General Public License v2.1
690 stars 82 forks source link

Line 442 in PDCNet.py. #5

Closed zwyking closed 3 years ago

zwyking commented 3 years ago

Hello, thanks for your excellent works and codes. Here, i have an issue for the line 442 in PDCNet.py. I think the codes should be "c_t=c23, cs=c13 ", which means the source is image1 and target is image2. (Maybe I wrongly understand your code, please correct me (@@;))

PruneTruong commented 3 years ago

Hi, so im1 is the target image (also referred to as the reference image, for example in PDC-Net paper) and im2 is the source image (also referred to as the query image, for example in PDC-Net paper). Therefore, the provided original version of the code is correct. Here, we estimate the flow field relating the target/reference to the source/query. This flow can be used to align the source/query to the target/reference by warping the source/query according to the flow. Let me know if you have other questions :)

zwyking commented 3 years ago

Thanks for your explanation for this issue. I have got it and understood this part. I find another problem, that memory consumption will gradually become larger with the training epoch. I want to know if it's normal.

PruneTruong commented 3 years ago

I don't think I had this issue before, but let me double check the published version, I will get back to you.

zwyking commented 3 years ago

Ok. This phenomenon is obvious in PDCNet-stage2. During my training precossing, the memory consumption is nearly to 200G with epochs=25.

PruneTruong commented 3 years ago

You're right, there seems to be a memory leakage that I didn't have before. I will investigate, in the meanwhile, setting the number of workers to 0 considerably reduces the required memory but might make the training slower. Also, if it crashes, you can restart it, it will restart from where it crashed. Sorry for the inconvenience

zwyking commented 3 years ago

Ok, thanks for your patient responses. This is a great work.

zwyking commented 3 years ago

I think the memory leakage would occur in the dataloader, because i find the memory add when a new epoch starts.

PruneTruong commented 3 years ago

Hi, it should be fixed now. There is increase within one epoch but only very minor increase between the epochs ( I am also storing some val logs). I trained 16 epochs of PDCNet_stage2 with about 40 GB of CPU memory without issue. Let me know if you still encounter some problem!

zwyking commented 3 years ago

Sorry for response you later. I want to know which file you have fixed, then I can directly replace my code (I have modefied on the orginal code) to watch the memory. Thanks a lot!

PruneTruong commented 3 years ago

Ah sure! The main thing is https://github.com/PruneTruong/DenseMatching/blob/main/datasets/mixture_of_datasets.py, where i had a list that kept growing at each sampling (therefore at the beginning of each epoch). I also added another sampling for megadepth (https://github.com/PruneTruong/DenseMatching/blob/main/datasets/MegaDepth/megadepth.py) that uses less memory but it is not absolutely essential. It also requires changing the arguments in the train_settings. I also fixed the coco loader (https://github.com/PruneTruong/DenseMatching/blob/main/datasets/object_augmented_dataset/coco.py) so the try except in the object dataset (https://github.com/PruneTruong/DenseMatching/blob/main/datasets/object_augmented_dataset/synthetic_object_augmentation_for_pairs_multiple_ob.py) is not needed anymore. However, I think with only the first fix of the mixture dataset, you should not get a leak anymore. If you try, could you let me know how it goes please.

zwyking commented 3 years ago

Ok, i have a try.

------------------ 原始邮件 ------------------ 发件人: "PruneTruong/DenseMatching" @.>; 发送时间: 2021年8月26日(星期四) 晚上7:25 @.>; @.>;"State @.>; 主题: Re: [PruneTruong/DenseMatching] Line 442 in PDCNet.py. (#5)

Ah sure! The main thing is https://github.com/PruneTruong/DenseMatching/blob/main/datasets/mixture_of_datasets.py, where i had a list that kept growing at each sampling (therefore at the beginning of each epoch). I also added another sampling for megadepth (https://github.com/PruneTruong/DenseMatching/blob/main/datasets/MegaDepth/megadepth.py) that uses less memory but it is not absolutely essential. It also requires changing the arguments in the train_settings. I also fixed the coco loader (https://github.com/PruneTruong/DenseMatching/blob/main/datasets/object_augmented_dataset/coco.py) so the try except in the object dataset (https://github.com/PruneTruong/DenseMatching/blob/main/datasets/object_augmented_dataset/synthetic_object_augmentation_for_pairs_multiple_ob.py) is not needed anymore. However, I think with only the first fix of the mixture dataset, you should not get a leak anymore. If you try, could you let me know how it goes please.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

zwyking commented 3 years ago

Hi, I have trained your new code without changes. The PDCNet_stage1 still has an obvious memory increase, I don't know if it's normal.

zwyking commented 3 years ago

QQ截图20210828094629 This is a snapshot of PDCNet_stage2 with 6 epochs. It's will be found a clear memory leak. Meanwhile, I notice a big memory increase after a validation.

PruneTruong commented 3 years ago

Hi, I am really sorry I was running the code on my main internal repo and I only realized now that i hadn't correctly pushed all files to this repo. Therefore I was not seeing the leak that you see. It was in https://github.com/PruneTruong/DenseMatching/blob/main/training/losses/multiscale_loss.py, .item() were forgotten in the logging of the loss, so i was dragging the whole gradient throughout the epochs. Without this, I trained on exactly this repo.

In Stage 1, I have an increase of about 1.8 GB WITHIN one epoch (reduced to about 1GB if using cv2.imread instead of imageio.imread) but the memory is freed at the end of the epoch. Therefore, i only have an absolute increase of 400MB at the end of first epoch (because creating all the classes for logging metrics) and top 200MB between the epochs after that (due to the logging in memory).

For stage2, I have an increase of about 5GB within one epoch, also released at the end of the epoch. Absolute increase between the epochs of top 100MB.

I am very sorry about the inconvenience and thanks for your patience!

zwyking commented 3 years ago

yeah, thanks for your reply. I have corrected this problem during my debug. Thanks for your excellent work!