Inconsistent metrics result between eval and denovo mode, and possible reason

irleader commented 1 year ago

This is a follow-up of issue #111, thanks to all the information provided, I am able to find a possible bug and its reason.

I did experiment with weights "pretrained_excl_yeast.ckpt", deepnovo original raw yeast dataset, and Casanovo version 3.1.0.

In eval mode (results displayed on console): aa_precision: 0.7057459339992785 aa_recall: 0.7071716086435299 pep_recall: 0.5046985050978404

In denovo mode (results manually calculated from output mztab file): aa_precision: 0.6827751198617065 aa_recall: 0.6843725335842246 pep_recall: 0.5046985051027288

There is difference between results from eval and denovo mode (aa_precision, aa_recall), so I studies model.py and its pytorch lightning implementation.

I first excluded the possibility that this may be caused by predictions without '$' or with multiple '$$' (it turns out there are no such predictions), because this line of code removes the first position of prediction without checking whether it is '$', I am referring to v3.2.0, the code in v3.1.0 is similar.

Then I checked validation part of the code, and find the inconsistence might be caused by the way pytorch lightning accumulates metrics from each step and calculates per-epoch metrics. So I manually accumulates the prediction data from each step and calculate metrics for one epoch, and find it is the same as the results from denovo mode, which verifies my guess. For manual calculation, I use methods from evaluate.py.

Manual calculation from validation_step() output: aa_precision: 0.6827751198617065 aa_recall: 0.6843725335842246 pep_recall: 0.5046985051027288

In order to further verify, I tried with deepnovo raw ricebean dataset, and see the same pattern.

In eval mode (results displayed on console): aa_precision: 0.5835929214378351 aa_recall: 0.583695772587527 pep_recall: 0.4223428193208169

In denovo mode (results manually calculated from output mztab file): aa_precision: 0.576437557764845
aa_recall: 0.5758105380102586
pep_recall: 0.4223428193248386

Manual calculation from validation_step() output: aa_precision: 0.576437557764845 aa_recall: 0.5758105380102586
pep_recall: 0.4223428193248386

So the conclusion is, pytorch lightning can accumulate pep_recall from each step and calculate it correctly for each epoch, but it may fail the same job for aa_precision and aa_recall.

Ps. I again would like to request numerical values of Casanovo's AA/peptide precision/recall for updated 9-species benchmark (Figure S5 from preprint). Thanks a lot.

melihyilmaz commented 1 year ago

Hi,

Thanks for pointing out the pytorch lightening bug relating to metric accumulation over steps, looks like that's reason behind discrepancy in results, and also the duplicate stop token checking - this shouldn't be happening in the current implementation but we still added a check with the recent PR #123 to catch these.

Also sorry for the delay but you can find the Casanovo peptide/AA precision values at coverage=1 for the updated 9-species benchmark below (corresponding to Figure S2 and S5 from preprint):

mouse peptide precision=0.546022191276223 AA precision=0.8246832939458176

ricebean peptide precision=0.762752583426578 AA precision=0.8981027063629017

human peptide precision=0.7320067005545287 AA precision=0.8993333233520767

tomato peptide precision=0.7904856253285351 AA precision=0.909782514214078

clambacteria peptide precision=0.6535406820543996 AA precision=0.868507208789983

honeybee peptide precision=0.6471156883806154 AA precision=0.8149989493510443

mmazei peptide precision=0.7260779989677485 AA precision=0.8701884249037223

bacillus peptide precision=0.7817952157663147 AA precision=0.8807972308692891

yeast peptide precision=0.8383814663340254 AA precision=0.915618641976876

irleader commented 1 year ago

Thank you very much for the values!

For preprint Figure S2 and S5 only:

I assume the results are from Casanovo v3.2.0 trained with MASSIVEKB dataset and tested on each of the revised 9 species, with beam=10 and precursor window size=30ppm as mentioned in the preprint?
For Casanovo bm (Casanovo v2.1.1) in Figure S2 and S5, is it also trained with MASSIVEKB dataset, or it is trained with the other 8 species in the revised benchmark?
MASSIVEKB dataset has all 7 modifications, while current release of revised 9 species benchmark doesn't contain the 4 N-terminal modifications yet (#125). For these benchmark results, do you use exactly the same data as in this link without N-terminal modifications?

Thanks!

melihyilmaz commented 1 year ago

Yes
Casanovo_bm was trained on the older version of the benchmark, i.e. trained on 8 species from the older benchmark and tested on the remaining 9th species on the original, similar to DeepNovo.
The benchmark dataset these results were obtained from includes N-terminal modifications (see Figure S4 for results when a subset of PTMs that DeepNovo and Novor couldn't predict were excluded from the benchmark). See the update on #125 and check out the regenerated benchmark once it's available.

irleader commented 1 year ago

A similar situation has been discover in train mode, the "AA precision AA recall Peptide recall" of validation dataset displayed on console is also much higher than the actual accuracy, which might also be caused by pytorch-lightning codes in model.py. I would also suggest not use these values, I have compare them with manually calculated AA precision and peptide recall values from output mztab files, these displayed values are consistently higher. They can only be used as an indicator of when to stop training but not accurate performance of benchmark dataset.

melihyilmaz commented 1 year ago

Thanks for pointing this out - we're also relying on manual calculation of precision values when reporting results

wsnoble commented 1 year ago

Notes from our discussion just now:

One hypothesis (from Wout) is that this may be due to a mismatch in the batch size, and it throws the calculation off.

Another is that it has to do with how Pytorch Lightning aggregates results at the level of epoch. But if that's the case, then others would have run into this as well. Googling did not turn anything up.

irleader commented 1 year ago

Thanks for letting me know!

wsnoble commented 1 year ago

I am reopening this because we are still hoping to track down this discrepancy. It might not be immediate, though.

bittremieux commented 1 year ago

Although the values seem different, this should also have been impacted by #252. We should check again after merging #253.

Lilferrit commented 4 months ago

I wrote a script to calculate the AA and Peptide precision of the Casanovo PSMs, and I am seeing a minor discrepancy between the output of my script and Casanovo's evaluate mode. I ran the script and evaluate mode on the sample data, which has no skipped spectra during sequencing, so the error is unlikely to be originating from that. From what I can tell the discrepancy is either from an error in my script or we're still seeing batch effects in eval mode. Here's the output from my script and eval mode:

(casanovo_env) C:\Users\gavin\OneDrive\Documents\casanovo>casanovo evaluate sample_data\sample_preprocessed_spectra.mgf -c data/casanovo.yaml
WARNING: Dataloader multiprocessing is currently not supported on Windows or MacOS; using only a single thread.
Seed set to 454
INFO: Model weights file C:\Users\gavin\AppData\Local\casanovo\casanovo_v420_v4_2_0.ckpt retrieved from local cache
INFO: Casanovo version 4.1.1.dev15+gcedfaa7.d20240625
INFO: Sequencing and evaluating peptides from:
INFO:   sample_data\sample_preprocessed_spectra.mgf
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
INFO: Reading 1 files...
sample_data\sample_preprocessed_spectra.mgf: 128spectra [00:00, 3208.59spectra/s]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Validation DataLoader 0:   0%|                                                                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]WARNING: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
WARNING: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
WARNING: D:\anaconda3\envs\casanovo_env\lib\site-packages\lightning\pytorch\utilities\data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 128. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

WARNING: D:\anaconda3\envs\casanovo_env\lib\site-packages\lightning\pytorch\utilities\data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 128. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.

Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  0.27it/s]INFO: Step       Train loss      Valid loss      Peptide precision       AA precision
INFO: 0 nan     0.376451        0.492188        0.745707
Validation DataLoader 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  0.26it/s] 
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃         Validate metric         ┃          DataLoader 0           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│   AA precision at coverage=1    │       0.7457072734832764        │
│ Peptide precision at coverage=1 │            0.4921875            │
│          valid_CELoss           │        0.376451313495636        │
└─────────────────────────────────┴─────────────────────────────────┘
INFO: ======= End of Run Report =======
INFO: Run Start Time: 24/07/11 10:45:17
INFO: Run End Time: 24/07/11 10:45:37
INFO: Time Elapsed: 0:00:19.993431
INFO: Executed Command: D:\anaconda3\envs\casanovo_env\Scripts\casanovo evaluate sample_data\sample_preprocessed_spectra.mgf -c data/casanovo.yaml
INFO: Executed on Host Machine: DESKTOP-P03U1SR
INFO: Max GPU Memory Utilization: 381 MiB

(casanovo_env) C:\Users\gavin\OneDrive\Documents\casanovo>python val_bug_test.py
AA Precision: 0.7432542927167355
Pep Precision: 0.48437499996215816

Also, here is the test script:

import depthcharge.masses
import casanovo
import depthcharge
import pyteomics.mztab

import casanovo.denovo
import casanovo.denovo.evaluate

def process_mgf_file(mgf_path):
    with open(mgf_path) as f:
        return [
            line[4:].strip() for line in f if line.startswith("SEQ=")
        ]

def main():
    mz_tab = pyteomics.mztab.MzTab("foo.mzTab")
    output = mz_tab.spectrum_match_table["sequence"].tolist()
    ground_truth = process_mgf_file("sample_data\sample_preprocessed_spectra.mgf")
    peptide_masses = depthcharge.masses.PeptideMass()

    aa_precision, _, pep_precision = casanovo.denovo.evaluate.aa_match_metrics(
        *casanovo.denovo.evaluate.aa_match_batch(
            ground_truth,
            output,
            peptide_masses.masses
        )
    )

    print(f"AA Precision: {aa_precision}")
    print(f"Pep Precision: {pep_precision}")

if __name__ == "__main__":
    main()

bittremieux commented 3 months ago

Your script is correct, so I think it's still an issue with aggregating the results across multiple batches.

Maybe this can be relatively easily checked by running evaluation with a single batch only (by expanding the batch size and/or reducing the number of spectra). Although, come to think of it, the batch size for inference should standard be 1024, so maybe that's already the case. Could you double-check this?

wsnoble commented 3 months ago

I haven't looked at the code, but doesn't it seem a bit crazy to try to compute these metrics in a batchwise fashion? Why not just aggregate a score vector and a label vector and then do a single call to the metric calculation at the end?

bittremieux commented 3 months ago

Now it's calculated at the end of each validation batch, where the loss is also calculated. Otherwise you have to keep all of the predictions and true labels in memory across the whole run, which can easily be on the order of millions. So in terms of code it's not really crazy but actually a natural way to do it. But maybe leading to slightly incorrect results then.

Lilferrit commented 3 months ago

I checked the default config and batch size for inference does indeed default to 1024. As for the batch effects, I may be misunderstanding the amino acid and peptide precision scores but wouldn't be possible to keep a count of the number of correct amino acids and peptides, as well as the total number of amino acids and peptides, and use this values to do the precision scores calculation at the end of an evaluation run?

bittremieux commented 3 months ago

We should first verify what the actual problem is. This batched data aggregation was only my hypothesis, we haven't isolated the issue yet. You could debug the code I referred to a bit more to see what the values there are and try to find the point where they deviate.

Lilferrit commented 3 months ago

After some further investigation it turns out that this issue persists even in the case where there's only one validation step. I'm still trying to track down the exact discrepancy though.

Lilferrit commented 2 months ago

The tentative implementation of #355 calculates the evaluation metrics in a way that is roughly equivalent to the test script. I'll reexamine this once #355 has been merged in.

Noble-Lab / casanovo

Inconsistent metrics result between eval and denovo mode, and possible reason #122