Open zehanna opened 2 months ago
Score calibration should work fine in cases like this. I recommend using it.
The performance of the calibration drops a bit when the sample composition is extreme (e.g., 99% plasmids), but geNomad automatically detects cases like this and deals with them properly (see these lines, in case you're curious).
Hi @apcamargo, thanks a lot for your answer. After running genomad with score calibration, I'm a little confused about some aspects of the output. E.g. in the file calibrated_aggregated_classification.tsv, some contigs have plasmid scores of very close to 1 (which if I'm correct their probability of being a plasmid is ~100%), but they are not in the file plasmid_summary.tsv. How can this happen?
My guess is that the empirical plasmid fracvtion within the sample was very low, but the algorithm should be robust to case like this. Can you share the contents of <prefix>_score_calibration/<prefix>_compositions.tsv
?
These contigs were excluded from <prefix>_plasmid_summary.tsv
because their calibrated scores were lower than 0.7 (default cutoff). You can try to use --relaxed
to get more contigs in the summary file (see here).
If you want, you can see the calibrated scores of every single sequence (not just the ones classified as plasmid) in <prefix>_score_calibration/<prefix>_calibrated_aggregated_classification.tsv
. This way you can compare the pre- and post-calibration scores for all contigs.
Hi, this is the output of *compositions.tsv:
model chromosome plasmid virus
marker 0.9112 0.0145 0.0743
nn 0.5764 0.3175 0.1061
aggregated 0.8863 0.0394 0.0743
My confusion is because I was already looking into the file calibrated_aggregated_classification.tsv, and there these contigs had plasmid scores of almost 1, but then they weren't included in the plasmid_summary.tsv in the summary directory. So the scores must have been already calibrated at this point, and something must have excluded them from being identified as plasmid despite an almost 100% probability of being a plasmid (according to *calibrated_aggregated_classification.tsv)
Ohh, that's probably because they have negative marker enrichment (which is one of the post-classification filters). Just use --relaxed
and these plasmids should appear in your summary file.
Hi, I have a question regarding the --enable-score-calibration flag. I understand that it considers the composition of the sample to output a 'real' probability, e.g. plasmid score of 0.4 would mean that there is a 40% chance that this sequence is a plasmid. And that without using score calibration, the scores are not actual probabilities. What I'm wondering is, if I run genomad on sequences which I already preselected on some criteria, e.g. a collection of short, circular contigs extracted from multiple metagenomes, would it be recommended to use the score calibration or not? Because in that collection I probably have a higher chance of finding plasmids than in a 'natural' assembly of an environmental metagenome. So what I'm wondering is, is the score calibration recommended only for 'natural' samples, or also samples where already a pre-selection of sequences (that are more likely viral or plasmid) has taken place?