Model not learning - Githubissues

naguileraleal commented 1 year ago

Hello! First of all, thank you for this awesome project!

I'm trying to use QA to annotate fibrosis on my images and I'm not having good results. This is what my images look like 016-22_1_4000_20000 016-22_1_4000_20000_mask 016-22_1_4000_20000_mask_overlay

Looking at the project's Tensorboard, I'm seeing the validation loss has some epochs where it diverges and then, after some epochs, drops back to normal values. I also see a lot of epochs with very large loss values. I tried training the model with several patch sizes (256, 512, 1024) and the loss behaves the same.
These losses were obtained after training with 306 training ROIs and 181 validation ROIs, with a patch size of 256x256. Each ROI is 512x512 pixels in size. The negative class predominates over the positive class. Train loss train-loss Validation loss test-loss

I'm trying to find out why the model is not learning. For that reason I made some modifications to the train_model.py script, like avoiding noise/blur/scaling data augmentations, test time data augmentations, and logging each layer's gradients to tensorboard.

After avoiding noise/blur/scaling data augmentations the loss values where smaller, but the behaviour was the same as before.

This is a histogram of the gradients of the model's last layer. last-layer-gradients I don't have a QA project that produces good segementations, so I don't know how should these gradients look like. If someone has this information it would be valuable to me.

The loss function is the CrossEntropyLoss https://github.com/choosehappy/QuickAnnotator/blob/57a13580a40ea10fe47637b7718bbe7c1051a424/train_model.py#L259 the last layer is a Conv2d https://github.com/choosehappy/QuickAnnotator/blob/57a13580a40ea10fe47637b7718bbe7c1051a424/unet.py#L53 and there is no activation function later on https://github.com/choosehappy/QuickAnnotator/blob/57a13580a40ea10fe47637b7718bbe7c1051a424/unet.py#L66 so the output's values are unbounded. If the output has 0 (or very near to 0) values, as it could, the loss values should be very large, as the ones I'm seeing. Why not including a SoftMax activation to the last layer's output? That's my next debugging move.

Pixels that belong to the "Unknown" class are assigned the -1 tag in the ground truth https://github.com/choosehappy/QuickAnnotator/blob/57a13580a40ea10fe47637b7718bbe7c1051a424/train_model.py#L90 Later on this mask is passed to the CrossEntropyLoss function. What is the result of applying this function to negative values? Does it ignore these pixels? I would like these pixels not to be considered when training. Is this the effect achieved when passing a negative value to this loss?

Any help is much appreciated.

jacksonjacobs1 commented 1 year ago

Hi naguileraleal, Thank you for the question! First of all, have you made any progress with this issue since you raised it?

I'll need a little more information about your training and validation sets. Can you tell me the labeling distribution (positive present vs. all negative) for each set?

naguileraleal commented 1 year ago

Hello! Sorry for the delay. I've spent a long time trying to debug the training, without success.

About the proportion of positive vs negative pixels, the value of the pclassweight parameter used for training the UNet is 0.939710278448174, which I know is really debalanced. The ratio (positive/total) pixels for the train and validation sets is 0.0543 and 0.0535 respectively. I do not have pixels in the 'Unknown' category.

jacksonjacobs1 commented 1 year ago

Thanks for the information. I would recommend taking some measures to make your classes more balanced. One strategy for doing this would be to only select and annotate patches which have fibrosis regions present. Does that make sense?

naguileraleal commented 1 year ago

Removing all the all-background patches solved the issue. There were a lot of them (~60%). This should be a consideration when using QuickAnnotator for annotating anomalies.

Thanks for your help!

choosehappy / QuickAnnotator

Model not learning #36