Usage of score calibration

apcamargo / genomad

geNomad: Identification of mobile genetic elements

https://portal.nersc.gov/genomad/

Other

196 stars 19 forks source link

Usage of score calibration #126

Open zehanna opened 2 months ago

zehanna commented 2 months ago

Hi, I have a question regarding the --enable-score-calibration flag. I understand that it considers the composition of the sample to output a 'real' probability, e.g. plasmid score of 0.4 would mean that there is a 40% chance that this sequence is a plasmid. And that without using score calibration, the scores are not actual probabilities. What I'm wondering is, if I run genomad on sequences which I already preselected on some criteria, e.g. a collection of short, circular contigs extracted from multiple metagenomes, would it be recommended to use the score calibration or not? Because in that collection I probably have a higher chance of finding plasmids than in a 'natural' assembly of an environmental metagenome. So what I'm wondering is, is the score calibration recommended only for 'natural' samples, or also samples where already a pre-selection of sequences (that are more likely viral or plasmid) has taken place?

apcamargo commented 1 month ago

Score calibration should work fine in cases like this. I recommend using it.

The performance of the calibration drops a bit when the sample composition is extreme (e.g., 99% plasmids), but geNomad automatically detects cases like this and deals with them properly (see these lines, in case you're curious).

zehanna commented 1 month ago

Hi @apcamargo, thanks a lot for your answer. After running genomad with score calibration, I'm a little confused about some aspects of the output. E.g. in the file calibrated_aggregated_classification.tsv, some contigs have plasmid scores of very close to 1 (which if I'm correct their probability of being a plasmid is ~100%), but they are not in the file plasmid_summary.tsv. How can this happen?

apcamargo commented 1 month ago

My guess is that the empirical plasmid fracvtion within the sample was very low, but the algorithm should be robust to case like this. Can you share the contents of <prefix>_score_calibration/<prefix>_compositions.tsv?

These contigs were excluded from <prefix>_plasmid_summary.tsv because their calibrated scores were lower than 0.7 (default cutoff). You can try to use --relaxed to get more contigs in the summary file (see here).

If you want, you can see the calibrated scores of every single sequence (not just the ones classified as plasmid) in <prefix>_score_calibration/<prefix>_calibrated_aggregated_classification.tsv. This way you can compare the pre- and post-calibration scores for all contigs.

zehanna commented 1 month ago

Hi, this is the output of *compositions.tsv:

model   chromosome      plasmid virus
marker  0.9112  0.0145  0.0743
nn      0.5764  0.3175  0.1061
aggregated      0.8863  0.0394  0.0743

My confusion is because I was already looking into the file calibrated_aggregated_classification.tsv, and there these contigs had plasmid scores of almost 1, but then they weren't included in the plasmid_summary.tsv in the summary directory. So the scores must have been already calibrated at this point, and something must have excluded them from being identified as plasmid despite an almost 100% probability of being a plasmid (according to *calibrated_aggregated_classification.tsv)

apcamargo commented 3 weeks ago

Ohh, that's probably because they have negative marker enrichment (which is one of the post-classification filters). Just use --relaxed and these plasmids should appear in your summary file.