Have you met memory leak problem when running model？

Sucran commented 6 years ago

Hi， @Ugness I met a RAM memory leak problem when running network.py and train.py， this issue confused me for a few days. I have run other pytorch repo which is OK. I run the code in Ubuntu 14.04, Pytorch 0.4.1, CUDA 8.0, cudnn 6.0.

Sucran commented 5 years ago

@Ugness Sorry, the threshold I test is 0.8. I have no test your option yet, I need to wait for any available GPU in my lab. I also have a problem, can you test the MAE result on all DUTS-TE without modifying the dataset? The difference of MAE result confused me.

Ugness commented 5 years ago

What do you mean by without modifying the dataset? I am going to upload all the result (MAE, F-measure, threshold) on google drive. I also upload the list of image file names in my DUTS-TE Dataset with it.

Sucran commented 5 years ago

@Ugness. I mean it should be 5019 images in DUTS-TE without deleting mismatching files, you should test on all 5019 images.

Ugness commented 5 years ago

While DUTS-TE-Mask has 2 more images than DUTS-TE-Image? My DUTS-TE-Image folder has 5019 images. I deleted 2 images from DUTE-TE-Mask because there was 5021 images.

RaoHaobo commented 5 years ago

Hi, @Ugness , I intergrate your flie of measure.py and train.py ，but I don't change the file of network.py . now , I set the value of batch_size is 2, at the first falling of learning rate,my train loss can falling .but after that,althought my learning rate falling ,my train loss never falling. And , I test my model on PASCAL-S ,the best value of MAE is 0.1243.could you help me and sovle this problem?

Ugness commented 5 years ago

@RaoHaobo Can you give me some captures of your loss graph?? You may found it on Tensorboard. And I think it is better to make new issue. Thanks.

RaoHaobo commented 5 years ago

I change the learning rate decrease by per 15000, but the case of train loss never falling is the same of your 7000. I results of train loss and learning rate as follows. Thanks!

Ugness commented 5 years ago

I think that graph looks fine.

But if you think that loss should be more less, I recommend you to increase lr decay rate and lr decay step. The hyper parameters on my code, I just followed the implementation on PiCANet paper with DUTS Dataset. And about MAE, it may related to batch size. When I changed batchsize 1 to 10(may be 4 I do not remember correctly), the performance was incrementally increased.

I’ll let you know the specific value of score when I found the past results. And I’ll also upload the loss graph of my experiment as well. Thanks.

RaoHaobo commented 5 years ago

I make the Ir decay rate from 0.1 to 0.5，the Ir decay steps is 7000.And my loss as follows

Why the train loss never falling after one opeoch？Do you meet the problem?

ghost commented 5 years ago

nice work and nice code! When I run 'python train.py --dataset ./DUTS-TR', there occurs a error:(it seems something wrong with tensorboardX, but i have no idea what to do): thanks for your reply~

RaoHaobo commented 5 years ago

@Dylanqyuan version of your tensorboardX is too high.

ghost commented 5 years ago

@Dylanqyuan version of your tensorboardX is too high.

It works! Thank you buddy!

Ugness commented 5 years ago

@RaoHaobo https://github.com/Ugness/PiCANet-Implementation/issues/16#issuecomment-479510008 I’ve uploaded my graph on that link. And I also suggest you to follow that links 3 steps to check if model is trained or not.

My graph is also fluctuating as like as yours, and looks it is not decreasing. And for your graph, I am concerned about the learning rate. I think it became too small to train the model effectively after 1 epoch. But I did not had such experiment about that, so it’s just my personal opinion.

If you want to check your models performance, I suggest you to follow the steps on the link. If you are worrying about non-decreasing training loss, I suggest you (and also I) to have more experiments with learning rate and the other hyperparameters. In detail,

make lr decay rate between 0.9~1 or decay step much larger.
Please carefully observe that if the model is trained enoughly on that lr. Thank you for your interest and I hope you to have much higher performance than my experiments! :)

p.s. please comment at #17 if you want to talk about this issue more. To make easy to find!

RaoHaobo commented 5 years ago

@Ugness I test your the '36epo_383000step.ckpt' on PASCAL-S ,and the result is
，but your result is , why? another problem:I add some my ideal on your code ,and I train my model is well: but when I test my model on your this code ：measure_test.py ,the test result is

RaoHaobo commented 5 years ago

@Ugness the second problem have been solved ,the first isn't sovled

Ugness commented 5 years ago

Sorry. I forgot to mention that all of my experiment results are on DUTS dataset only. I updated my readme file. If you got my result from the README.md, I trained and tested the model on ONLY DUTS Dataset. So the result on PASCAL-S dataset may differ.

RaoHaobo commented 5 years ago

@Ugness Ok， this is your trained model ,and I use it to test on PASCAL-S and SOD ,the max_F is 0.8379.Could you test your model on other dataset?

RaoHaobo commented 5 years ago

@Ugness this code on your measure_test.py . but the github.com/AceCoooool/DSS-pytorch solver.py is ,I think they have big different.

Ugness commented 5 years ago

I've made that .sum(dim=-1) because my code evaluates several images on parallel. github.com/AceCoooool/DSS-pytorch solver.py calculates the prec / recall on a single image at once, and my code calculates all images at once. The whole dimension of y_temp and mask is (batch, threshold_values, H, W). If I execute .sum() like as below code, it would sum all values in y_temp. Although we should sum only on H and W axis. And for the 1e-10, I made them for avoiding division by zero problem. If you think my explanation is wrong, please give me your advice. Thanks.

RaoHaobo commented 5 years ago

@Ugness i mean that tp + 1e-10,the 1e-10 maybe take out ,I try to take out it ,but the max_F falling much. I also use the Dss code to test your model on DUTS-TE,the result is bad.

Ugness commented 5 years ago

How much difference that follows from the error? Is the difference significant? Let me know it. Thanks.

RaoHaobo commented 5 years ago

when threhold equal to 1,the prec must be 0,but your result equal to 1

RaoHaobo commented 5 years ago

@Ugness The writer.add_pr_curve() function in measure.py can't work , It never show in the tensorboard . I think it caused by the version tensorboard. c58dafe5b088c12bd987229dc5cb70f

Ugness commented 5 years ago

https://github.com/tensorflow/tensorboard/releases Would you try it with the tensorboard 1.8.0???

Ugness / PiCANet-Implementation

Have you met memory leak problem when running model？ #9