Fewer viral contigs identified from genomad vs virsorter2

lingrongjin commented 3 months ago

Hi Antonio,

Thanks for developing the genomad pipeline. I found it very easy to use and the outputs were straightforward. However, I do have a few questions that I would like to hear your suggestions. I used genomad to predict viral contigs from single cell amplied genomes, but I found that it identified much fewer viral contigs compared to virsorter2 (8083 vs 18821). I did a little inspection and found that the overlap between the two tools was about 5000 seqs and there were ~13000 seqs (~400 were > 5kb with >= 1 hallmark) predicted by virsorter2 as viral which was not included in genomad's prediction. This was a little unexpected to me given that from my understanding, geNomad seems to include a more comprehensive viral marker database and the classification benchmark results were also better according to the paper. I'm wondering if you have any idea what may be causing the prediction differences (I understand that different tools can have different results, but the difference seems large in my case) and if you would suggest using geNomad in combination with other tools to maximize prediction accuracy. For your reference, I used genomad v1.8.0 with default threshold. I filtered my contigs to only keep those > 1500 bp (as I used --min-length 1500 in virsorter2) and split the contigs into ~75k per batch and run genomad with --enable-score-calibration on.

Thanks for your help!

Karen Jin

apcamargo commented 3 months ago

By default, geNomad will apply some additional filters to the classification results. So only sequences with a high score are flagged as virus. The filters are more aggressive for short sequences (less than 2.5kb), as geNomad requires them to encode at least one virus hallmark gene. You can read more about these filter here.

These filter were implemented after the paper was published, since I noticed that most people preferred to have less sequences that are more reliably classified. This means that, by default, precision is preferred over recall. If you want to maximize discovery, use the --relaxed parameter (which is explained in detail in the link above). This will disable all filters and provide you with all the sequences that geNomad classified as virus, regardless of their score or amount of virus hallmarks. If you don't change any other parameter, geNomad will skip all the slow steps (since you already finished computing them) and will only redo the filtering, which is very fast.

Let me know your results after using --relaxed!

apcamargo commented 3 months ago

you would suggest using geNomad in combination with other tools to maximize prediction accuracy

This is a tricky question. In principle, I'm not a big fan of doing that because you won't have an estimate of the false discovery rate of the discovery process. As you add more tools, the complexity of the methodology increases and, ideally, you'd have to do some additional tests to have an idea of the expected false discovery rate of the process.

That said, if you can find sequences that are clearly viral and that geNomad can't identify, for whatever reason, I won't tell you that you shouldn't use them. Different methodologies will always have cases where they work better than others. You just need to be careful to integrate them in a way that allows you to have an idea of the expected performance (otherwise, you'd just take the union of a dozen of different tools and call it a day). It might be possible to combine geNomad and VirSorter2 in a way that will provide results that are better than the individual runs, but I've never tried this and I can tell you it's not trivial to benchmark in a robust manner. Taking the union will for sure decrease precision substantially, taking the intersection will reduce recall by a lot.

Of course, this is my opinion. Benchmarking in bioinformatics is a complex topic and different people will have different opinions.

lingrongjin commented 3 months ago

Hi Antonio,

Thanks for your quick response! I checked genomad's results using --relaxed mode and it indeed boost the number of viral contigs predicted from ~8k to ~16k. I re-checked the overlap with Virsorter2 and the intersect also increased by about 1k+ seq, but at the expense of increasing amount of unique seqs predicted by the two tools. I agree that combining different tools may not be straightforward and I just wanted to make sure that the differences are a result of different computational approaches/thresholds but not some systematic bias like different database representation resulting in one type of virus easier to be detected by one tool but not the other.

apcamargo commented 3 months ago

Glad to hear that worked out :) I'll close the issue for now, but please let me know if you have any other questions

apcamargo / genomad

Fewer viral contigs identified from genomad vs virsorter2 #110