the result of training - Githubissues

gaowu9595 commented 5 years ago

Why focal_loss and mean_iou are always zero when the training progress is going on? I didn't change the model, I used pascalvoc2012 for training

YuhuiMa commented 5 years ago

The code is only able to split two categories - foreground and background now. I apply the project on OCT images to complete the task of choroidal segmentation. So that you use my code on pascalvoc2012 could not work. The code will be updated later to accommodate multi-segmentation tasks.

calicratis19 commented 5 years ago

I'm having same issue. Mean IOU and focal loss are always 0. I am using mscoco data set. I have modified the data set such that the segmentation map has only two values 0 and 1 which is what this repository requires.

screenshot from 2019-01-03 16-53-05

calicratis19 commented 5 years ago

The iou actually going to the opposite direction. Its decreasing. First couple of iteration we can see that it decreased to nearly 0. screenshot from 2019-01-04 15-16-12

I have used batch of 1 and batch of 2 images for training. Can't increase the batch size due to OOM error. Can this very small batch size be a reason for this issue?

YuhuiMa commented 5 years ago

Hello Tahlil Ahmed Chowdhury, It's not clear what's wrong with you. Maybe you can save the segmentation maps during training and observe them combined with loss and iou. You can also send me your data set if it's convenient. Good luck! Yuhui Ma

发件人：Tahlil Ahmed Chowdhury notifications@github.com 发送日期：2019-01-04 17:17:52 收件人：YuhuiMa/DFN-tensorflow DFN-tensorflow@noreply.github.com 抄送人：Yuhui Ma 20164228014@stu.suda.edu.cn,Comment comment@noreply.github.com 主题：Re: [YuhuiMa/DFN-tensorflow] the result of training (#5) The iou actually going to the opposite direction. Its decreasing. First couple of iteration we can see that it decreased to nearly 0.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

YuhuiMa commented 5 years ago

The iou actually going to the opposite direction. Its decreasing. First couple of iteration we can see that it decreased to nearly 0.

I have used batch of 1 and batch of 2 images for training. Can't increase the batch size due to OOM error. Can this very small batch size be a reason for this issue?

Yes, you should have larger memory or simplify network structure to avoid OOM error.

calicratis19 commented 5 years ago

Hi @YuhuiMa thanks very much for replying.

To resolve the OOM error I rescaled each input and segmentation image to 256x256. After that I could train with 8 batch size. But the issue was same. IOU still start with some values and then goes down to 0 and stays there.

Then as you suggested, I have saved the segmentation map on each step while training. To simplify the debugging I used a single image for training. For the batch training input I used the duplicate of the same image. And I observed that the each pixel value of the output segmentation map is always 0. It doesn't change at all.

I have also noticed that you subtract the mean from each image. Which makes the image to be ranged from -1 to 1. So I ommitted that part so that image value range stays between 0 to 1 but still no improvement.

I have forked your repo and put the output segmentation map on the Result folder. I have also modified the code on utils.py to load a single image batch size number of time and rescale it to 256x256. There is a jupyter notebook Visualize_Result.ipynb which visualizes the training output of each step.

The input image is in the root folder of the repo named 000000569046.jpg and corresponding segmentation map is named 000000569046.png

I can provide you with more data if you need. Again many many thanks for taking the time to reply to the issue. Really appreciate it.

EDIT: In my forked repo the training could be started and the each step segmentation output can be saved in the Result folder by simply just running the following command

python3 main.py --batch_size 8

calicratis19 commented 5 years ago

The output segmentation map during training contains negative values. This is not supposed to happen. Each step the max min values of the seg map image is like on the range

max: 0.044997886 min:-0.047073435 max:0.03952158 min:-0.040719625 etc.

Why the values are negative on the training output segmentation map image? Any idea?

calicratis19 commented 5 years ago

On each iteration, the absolute value of the max min pixel value of the output segmentation map are almost equal. On the below image, the two values after Training for iter 823/402312 this line are the max min pixel value of the segmentation map. screenshot from 2019-01-09 18-07-10

YuhuiMa commented 5 years ago

You should compute softmax of these values. So you can get pixel values from 0 to 1.

calicratis19 commented 5 years ago

After calculating softmax, first couple of step result screenshot from 2019-01-12 13-16-30

calicratis19 commented 5 years ago

The above images are generated from the prediction and ground_truth values calculated by the evaluation function on the dfn_model.py file.

YuhuiMa commented 5 years ago

It's incredible that no foreground is segmented while loss is decreased. Let me see the pw_softmaxwithloss_2d and focal_loss. They shouldn't have any problems. It's so confusing. Maybe you could observe those values before softmax and their changes during training.

shupinghu commented 4 years ago

I have met the same problems with you, did you solve this problems later?

shupinghu commented 4 years ago

The above images are generated from the prediction and ground_truth values calculated by the evaluation function on the dfn_model.py file.

I have met the same problems with you, did you solve this problems later? I find that the mean iou would not decrease to 0 if I set the label in utils.py like: labels[:, :, 0] = label labels[:, :, 1] = ~label the original set is labels[:, :, 0] = ~label labels[:, :, 1] = label Do you know why this hapened?

mhkoosheshi commented 3 years ago

OOM error @calicratis19 Hello sir I've changed the code to avoid the OOM, but the problem is I have the following error. Do you know why and would you help me?

ValueError: Cannot feed value of shape (8, 256, 256, 3) for Tensor 'Placeholder:0', which has shape '(8, 512, 512, 3)'

thank you

YuhuiMa commented 3 years ago

@mhkoosheshi Please make sure that you have modified all 512 as 256 in dfn_model.py.

mhkoosheshi commented 3 years ago

Thanks @YuhuiMa Yes you are right I had missed some in the file. I have three more questions if you won't mind.

What does "iter" mean? We are used to "epochs" when it comes to training steps, and these epochs are consisted of iters. What is it here that we have iters only?
When I train with batch_size 1, there are 4050 iters to be completed. When batch_size is 2, it's 2020. When I go with 8, there are 5080 iters. What is happening here? It doesn't make training faster, since there are more iters and it seems iters take the same time no matter what batch_size we are uusing.
Your code saves the trained model every 3 iters. It's too time-consuming. Is it necessary for the code structure? If I change it to for example saving every 20 iters, would the code as a whole and its training get hurt?

Thanks for your time I really appreciate.

YuhuiMa commented 2 years ago

@mhkoosheshi

One iter means that you feed one batch of samples into the network during the training phase, while one epoch means that feed the whole training set into the network. So max iters can be calculated as follows: num_iters = num_trainsamples * num_epochs // batch_size
In this project, the total training scale is dependent on num_trainsamples and num_epochs (set in config.py) rather than batch_size.
You can modify the frequency of saving model to reduce the training time.

YuhuiMa / DFN-tensorflow

the result of training #5