Question about the evaluation metrics.

EricKani commented 5 years ago

Hi~

I am a student of Beijing Institute of Technology. And we have some samples from hospital. I am very excited about your job on pathological subset classification.

I want to reproduce this work with pytorch, but the result is not good. I want to confirm where is the problem arise, data quality or bug code(pytorch).

When I run the code of yours, I found your metrics, f-measure, precision and recall are all calculated on binary classification. There is not multi-class metric in your code. Is there some wrong in my understanding? Thank you very much!!

Eric Kani

duducheng commented 5 years ago

Thanks for your interests on our work.

First I would like to emphasize that the metrics in metrics.py are not the exact metrics we use in our paper. They are just some codes come from my previous projects. Please note the metrics do not contribute to the loss, they are just monitoring the training. If you are interested in the metrics we used, please refer to our paper. We used MCC, macro-average F-measure, etc.

Besides, I recommend you to use the Keras code, as Keras is indeed suitable for normal MIC tasks, except that you are expecting to operate on gradients or other complicated operations. We have a PyTorch version for DenseSharp as well, but we are planning to release the code after our MICCAI submission.

DenseSharp is simply a multi-task network for segmentation and classification. Personally, I think the problem may come from the data quality, since the cross-modal pathological prediction is really a difficult task. You may also refer to our model results and observer study.

Good luck.

Jiancheng.

EricKani commented 5 years ago

Thank you,

I found you did three class label in dataset, But you used 'binary_crossentropy' loss when you compile the model...

Is these appearance means your code have no performance in your paper? I just want to justify my guess. If this repository is only your draft. I will not try to justify this code.

emmmm... I have no offense. Maybe my English expression is not very proper. Thank you!!

Eric Kani

Edit： Sorry, my fault... It is 'categorical_crossentropy' in your code. I don't know what wrong with me... I just need to change your metrics...

duducheng commented 5 years ago

I found no typos in my training script train.py.

We did not provide evaluation scripts, since we are not able to open source data, which makes the evaluation scripts meaningless. However, we use standard evaluation metrics; indeed, the metricspackage in sklearn is used.

The repository contains an official implementation reference (instead of draft) for DenseSharp Networks. Feel free to use the reference training script, and the explore.ipynb for understanding the code usage.

EricKani commented 5 years ago

Thank you, Sorry for my carelessness.

I have reproduce the fmeasure metrics for three class classificaton. Like below,

def AA_precision(y_true, y_pred):
    binary_truth = y_true[:, 0]`
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 0]
    return precision(binary_truth, binary_pred)

def AA_recall(y_true, y_pred):
    binary_truth = y_true[:, 0]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 0]
    return recall(binary_truth, binary_pred)

def AA_fmeasure(y_true, y_pred):
    binary_truth = y_true[:, 0]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 0]
    return fmeasure(binary_truth, binary_pred)

def MIA_precision(y_true, y_pred):
    binary_truth = y_true[:, 1]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 1]
    return precision(binary_truth, binary_pred)

def MIA_recall(y_true, y_pred):
    binary_truth = y_true[:, 1]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 1]
    return recall(binary_truth, binary_pred)

def MIA_fmeasure(y_true, y_pred):
    binary_truth = y_true[:, 1]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 1]
    return fmeasure(binary_truth, binary_pred)

def IAC_precision(y_true, y_pred):
    binary_truth = y_true[:, 2]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 2]
    return precision(binary_truth, binary_pred)

def IAC_recall(y_true, y_pred):
    binary_truth = y_true[:, 2]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 2]
    return recall(binary_truth, binary_pred)

def IAC_fmeasure(y_true, y_pred):
    binary_truth = y_true[:, 2]
    y_pred = K.argmax(y_pred, 1)
    y_pred = tf.one_hot(y_pred,3)
    binary_pred = y_pred[:, 2]
    return fmeasure(binary_truth, binary_pred)

def weighted_fmeasure(y_true, y_pred):
    AA_f = AA_fmeasure(y_true, y_pred)
    MIA_f = MIA_fmeasure(y_true, y_pred)
    IAC_f = IAC_fmeasure(y_true, y_pred)
    aa = K.cast(K.sum(y_true[:, 0]),tf.float32)
    mia = K.cast(K.sum(y_true[:, 1]),tf.float32)
    iac = K.cast(K.sum(y_true[:, 2]),tf.float32)

    return (aa*AA_f+mia*MIA_f+iac*IAC_f)/(aa+mia+iac)`

But I don't know how to reproduce multi-class mcc in keras with sklearn.metrics.matthews_corrcoef...

def mcc(y_true, y_pred):
    print('type: ', type(y_true),type(y_pred))
    print('shape: ',K.shape(y_true), K.shape(y_pred))
    with tf.Session():
        y_true = y_true.eval()
        y_pred = y_pred.eval()
        print(y_true)
        MCC = metrics.matthews_corrcoef(np.argmax(y_true, 1), np.argmax(y_pred, 1))
    return MCC

Could you please tell me what wrong in my code? Thank you vary much! Looking forward to your reply.

Eric Kani

duducheng commented 5 years ago

I did not use mcc for monitoring the training process, and I do not think I need to. I use mcc once in the evaluation.

EricKani commented 5 years ago

OK, I see that. You only calculate mcc for your test fold mentioned in your paper off line.

By the way, how many epochs is needed in your experence with your code. I encounter severe overfitting in my training phase. I have 500 nodules. And after two epochs, the performance began to degrade...

Best

EricKani commented 5 years ago

And could you please tell me the performance of a single model in evalution phase? (not emsemle and test.)

Thanks~

duducheng commented 5 years ago

Overfitting is easily encountered. We use cross validation to search the stopping epoch, which results in about 12 epochs. But the results varies for different training, we thus use ensemble. For single model, the metric is usually 1~2 point lower than the ensemble.

EricKani commented 5 years ago

Thanks for your reply. My overfitting is very obviously. Only after 180 iterations, my cross entropy loss begin to degrade(increase, but segloss is normal). Each iteration with 24 nodules in a batch. I am a little confused.. This task is hard. Do you have some idea for me?

duducheng / DenseSharp

Question about the evaluation metrics. #1