CRF implementation in Keras is not not giving good results

OpenCv30 commented 5 years ago

Hi, I am a student and working on Camelyon’16 as my Master’s project. I was going through your very impressive paper - Yi Li and Wei Ping. Cancer Metastasis Detection With Neural Conditional Random Field. Medical Imaging with Deep Learning (MIDL), 2018. And, found that you have implemented CRF in your code on top of Resnet 18. So far I am using Resnet50 but my FROC score is not going up from 0.55.

So, I have decided to use your approach and re-implemented your code in Keras (backend-Tensorflow). But the performance of the trained model is not even close to your results. Best FROC of Resnet18+CRF trained model-0.55. Lot of FPs are coming. My Resnet18 is taken from https://github.com/raghakot/keras-resnet

My queries-

My training loss- BCE- started from 1.16 and finally settled to 0.8639 and validation loss 0.8528. It is right or loss should go further down. I have run for 30+ epochs, but the loss remains the same(plateau). Don’t know why? (please refer attached image below)
Weight plot doesn’t look closer to what you have shown in your paper. In your, case all the positional patch weights W[0,0,0,0] <0,W[0,1,0,1] <0..W[2,2,2,2] <0 but I am not seeing that. As per equation, these will not affect the final predictions. (please refer attached image below)

Can you help me in understanding why loss is not going further down? It gets plateau after a certain number of epochs and after that no effect even with Cyclic Learning rate [ 1e-4,1e-5,1e-7]. This behavior is common across many models – rennet50/101/18 + Inception V3. Please help me to solve these problems. I shall be thankful to you.

Training Configration config

TensorBoard ACC/LOSS plot

Just Validation loss and training loss ( after 16 epochs ) -- Orange is val-loss, Blue is Training BCE loss val_train_loss

Weight plots- plot across epochs-

weight_epochs

Just one of the epoch(16) weight map from which heatmap is generated

Heatmaps Test_001.tiff ( cam'16 test data set) Results from my trained model- at level 8

001

Results for your model-at level 6 test_001_crfbaidu

Clearly, your model performs far better than my trained model.

I have matched my CRF implementation in kears with yours in pyTorch. For the same input in both the model, I am getting the same output.

Please help me to reproduce your results in Kears+ TF.

yil8 commented 5 years ago

@OpenCv30 Hi, so sorry for the late reply. First of all, did you only try resenet50 through my repo? There are other users using the default config resnet18 and were able to reproduce my results. I would definitely recommending reproducing my results through my repo first before trying re-implementing yourself.

OpenCv30 commented 5 years ago

Thank for your reply. It is my pleasure to interact with you.

I am using resnet50 architecture from keras without imagenet weight as i am not good at pytorch. I have reproduced the heatmap of Test026.tif which is matching to your’s shared one. So, there is no problem with the model checkpoints given by you. By the ways, it this normal behaviour of training that initially loss falls down and later remains same without any change? Please suggest what results you like to see or i should try to improve performance of my model. If you want then i can share my implementation of CRF layers code in Kears.

yil8 commented 5 years ago

@OpenCv30 Typically, deep neural networks will overfit training data essentially, so I would expect training loss approaching 0 if it's not a gigantic dataset. Sorry, but I guess I don't have enough time to look into your code/implementation of CRF and check if it's correctly implemented. On the other hand, I thought pytorch is pretty easy to pick up...

OpenCv30 commented 5 years ago

@yil8 As you have suggested, first I will learn pytorch to train a model from your code directly and generated heatmaps for Cam'16 test data to get desired Froc. But as such, I have never seen a BCE loss coming close to zeros for Camelyon Dataset.
I understand that you might not have time to see my code. But, can you please tell me from my training loss/acc profiles and weight maps that training is correct or not. My understanding is that initially model was getting trained as CRF weights are changing but later CRF weights got stagnant. Might be vanishing gradient issue. Does it make sense to use this model to generate Hard negatives and train again even though your 200000 points has hard negatives?

I will get back to you soon please don't close this issue.

yil8 commented 5 years ago

@OpenCv30 That's indeed weird that you can't make your BCE loss goes to zero on the training data. Can you turn off CRF and just do purely CNN and see if you can make your BCE loss goes to zero? Then you know if you may have bugs either in the CNN part or the CRF part.

OpenCv30 commented 5 years ago

@yil8 - As you have suggested, I started training only resnet18 with your the patch locations shared by you. I am seeing again loss is plateaued near 0.88. It is not going down even after 40 epochs. BLUE LINE- Training Loss ORANGE LINE - Val loss

onlyResnet18

Highest validation accuracy is -- 0.930. Please suggest me what could be wrong. I have used the same training framework where I was seeing loss reaching around 0.10-0.15 before the plateau.

yil8 commented 5 years ago

@OpenCv30 are you training with my code or with your code? It's also strange that you have a significantly lower val-loss than train-loss, which almost never happened in my life...

baidu-research / NCRF

CRF implementation in Keras is not not giving good results #34