New metric generation and reporting

dbuscombe-usgs commented 2 years ago

First, I think adopting a similar approach to this, i.e. generating all stats from the CM, and also reporting the CM, makes much more sense

Also, explore Matthew's Correlation coefficient

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7, generalized to multiclass

dbuscombe-usgs commented 2 years ago

[x] Implement what is discussed here #37

dbuscombe-usgs commented 2 years ago

function that will take a validation subset, and model predictions on that subset, and create a confusion matrix, then report an array of stats from the confusion matrix:

[x] Overall Accuracy
[x] Precision
[x] Recall
[x] F1 Score
[x] Intersection Over Union
[x] Mean Intersection Over Union
[x] Frequency Weighted Intersection over Union

as well as the following stats not generated (to my knowledge) by the c/m

[x] Kullback Leibler divergence
[x] Matthews Correlation Coefficient
[x] Dice coefficient

to be added to the end of the do_train script

dbuscombe-usgs commented 2 years ago

that way, re-generating the stats would only require running do_train again using do_train=False in the config file

dbuscombe-usgs commented 2 years ago

Lots of good ideas here https://nowosad.github.io/post/motif-bp2/

this link is now dead

dbuscombe-usgs commented 2 years ago

This is still important to look at. On multiclass problems, mean IOU is often pegged at 1.0 from the start of model training. This is not correct. Also, multiple metrics would be useful for some downstream model applications

dbuscombe-usgs commented 2 years ago

A new issue is that KLD appears always to be infinity for multiclass models in the evaluation step at the end of train_model.py. @CameronBodine are you seeing the same thing?

CameronBodine commented 2 years ago

Yes, KLD on train and validation subsets is always infinity. I don't recall it being an issue when doing my binary depth model.

dbuscombe-usgs commented 2 years ago

KLD = infinity was caused by non-normalized, non-integer model outputs

solution is to compute the argmax, the one-hot encode

            kl = tf.keras.losses.KLDivergence() #initiate object

            est_label = np.argmax(est_label.squeeze(),axis=-1) #argmax to flatten()

            #one-hot encode
            nx,ny = est_label.shape
            lstack = np.zeros((nx,ny,NCLASSES))
            lstack[:,:,:NCLASSES+1] = (np.arange(NCLASSES) == 1+est_label[...,None]-1).astype(int) 

            #compute on one-hot encoded integer tensors
            kld = kl(tf.expand_dims(tf.squeeze(lbl), 0), lstack).numpy()

dbuscombe-usgs commented 2 years ago

I have implemented all of the metrics on this codebase, i.e. generating all stats from the CM

Need to update docs to credit the above

For example, this is a random model output (bad). The mean IoU much more accurately reflects the situation (generally bad), as does mean KLD, which is now fixed

Mean of mean IoUs (validation subset)=1.000
Mean of mean IoUs, confusion matrix (validation subset)=0.130
Mean of mean frequency weighted IoUs, confusion matrix (validation subset)=0.238
Mean of mean Dice scores (validation subset)=0.873
Mean of mean KLD scores (validation subset)=1.329

will appear in train_model.py as part of the final validation step on 10 batches of validation samples using a modified plotcomp_n_metrics

this function also creates two files of per-image model metrics. example files on a small number of validation samples below (just for illustration)

noaa_spring2022_resunet_model1_model_metrics_per_sample.csv noaa_spring2022_resunet_model1_model_metrics_per_sample_per_class.csv

these mods require a mod to doodleverse_utils.py that will call all the model metrics in its own py script, useful also for Zoo and other extensibility

dbuscombe-usgs commented 2 years ago

started a new metrics branch with these changes implemented. One could use this branch to use train_model.py to generate new sets of metrics for already trained models, usign do_train:false in the config file

Next is keras implementation iof the matthews correlation coefficient. Looks straightforward:

>>> metric = tfa.metrics.MatthewsCorrelationCoefficient(num_classes=2)
>>> metric.update_state(y_true, y_pred)
>>> result = metric.result()
>>> result.numpy()
-0.33333334

Requires pip install tensorflow_addons in the conda env

dbuscombe-usgs commented 2 years ago

.... was not straightforward, but a solution has been obtained

First, this is easily tested and works on arbitrary matrices of zeros and ones

size = (768, 768)
y_true=np.random.randint(0, 1, size=size)
y_pred=np.random.randint(0, 1, size=size)
metric = tfa.metrics.MatthewsCorrelationCoefficient(num_classes=2)
metric.update_state(y_true, y_pred)
metric.result()

however I have not been able to determine how to adapt to multiclass, for example

size = (768, 768)
y_true=np.random.randint(0, 2, size=size)
y_pred=np.random.randint(0, 2, size=size)
metric = tfa.metrics.MatthewsCorrelationCoefficient(num_classes=3)
metric.update_state(y_true, y_pred)
metric.result()

yields

InvalidArgumentError: in user code:

    File "/home/marda/anaconda3/envs/gym/lib/python3.10/site-packages/tensorflow_addons/metrics/matthews_correlation_coefficient.py", line 85, in update_state  *
        new_conf_mtx = tf.math.confusion_matrix(

    InvalidArgumentError: `labels` out of bound
    Condition x < y did not hold.
    First 3 elements of x:
    [0 0 0]
    First 1 elements of y:
    [3]

there seems to be no documentation on what arguments update_state takes - should the values be a certain dtype or shape? arrays or tensors?

If I one-hot encode

y_true_1h = np.zeros((size[0],size[1],3))
y_true_1h[:,:,:3+1] = (np.arange(3) == 1+y_true[...,None]-1).astype(int) 

y_pred_1h = np.zeros((size[0],size[1],3))
y_pred_1h[:,:,:3+1] = (np.arange(3) == 1+y_pred[...,None]-1).astype(int) 

metric = tfa.metrics.MatthewsCorrelationCoefficient(num_classes=3)
metric.update_state(y_true_1h, y_pred_1h)
metric.result()

I get the same error that I do not understand. I can't seem to find further examples or docs. further trials yielded nothing more of note

Next, sklearn (not a current gym dependency) has a similar implementation that might work, e.g.

from sklearn.metrics import matthews_corrcoef
matthews_corrcoef(y_true.flatten(), y_pred.flatten())

However, I found an implementation here that I modified to create my own implementation (without any additional dependencies) consistent with the other metrics

def MatthewsCorrelationCoefficient(confusionMatrix):  

    t_sum = tf.reduce_sum(confusionMatrix, axis=1)
    p_sum = tf.reduce_sum(confusionMatrix, axis=0)

    n_correct = tf.linalg.trace(confusionMatrix)
    n_samples = tf.reduce_sum(p_sum)

    cov_ytyp = n_correct * n_samples - tf.tensordot(t_sum, p_sum, axes=1)
    cov_ypyp = n_samples ** 2 - tf.tensordot(p_sum, p_sum, axes=1)
    cov_ytyt = n_samples ** 2 - tf.tensordot(t_sum, t_sum, axes=1)

    cov_ytyp = tf.cast(cov_ytyp,'float')
    cov_ytyt = tf.cast(cov_ytyt,'float')
    cov_ypyp = tf.cast(cov_ypyp,'float')

    mcc = cov_ytyp / tf.math.sqrt(cov_ytyt * cov_ypyp)
    if tf.math.is_nan(mcc ) :
        mcc = tf.constant(0, dtype='float')
    return mcc.numpy()

which seems to work. An example output of the new metric generating function:

{'OverallAccuracy': 0.6961568196614584,
 'Frequency_Weighted_Intersection_over_Union': 0.5368461005780909,
 'MeanIntersectionOverUnion': 0.3696620352309895,
 'F1Score': array([       nan, 0.67122082, 0.43898921, 0.81816669]),
 'Recall': array([0.        , 0.62919626, 0.29834483, 0.79084303]),
 'Precision': array([0.        , 0.71926089, 0.83049958, 0.84744599]),
 'MatthewsCorrelationCoefficient': 0.43273914}

some plots of the relationship between metrics with a sample dataset. MCC tracks with mean IOU

tmp

dbuscombe-usgs commented 2 years ago

New doodleverse_utils version 0.0.4 is posted that contains new model metrics. For now, this is only required for users of the new metrics branch of segmentation gym.

pip install doodleverse-utils -U to upgrade from an existing activated gym repository

further, I have now tested the code using an already-trained model, for multiclass, and binary problems

here is a plot of metrics for a binary (water/no water) model

tmp

dbuscombe-usgs commented 2 years ago

I believe I can close this issue, but first will do some tests with the new functions. In due course, @CameronBodine it would be helpful if you could checkout the new_metrics branch, update doodleverse_utils, and trial on your greyscale multiclass model - ta!

Doodleverse / segmentation_gym

New metric generation and reporting #40