apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
193 stars 19 forks source link

Discrepancy in Output Counts in Genomad #119

Open F4NG666 opened 2 months ago

F4NG666 commented 2 months ago

Hi,

I hope this message finds you well.

I am currently using Genomad for analyzing a dataset of 39,910 sequences. However, I’ve noticed discrepancies in the output files that I need clarification on:

The summary file contains only 38,449 rows. The taxonomy file generated by the annotation module contains 39,888 rows. Could you please help me understand why there is a difference in the number of rows between the input sequences and these output files? Specifically, I would like to know where and why the sequences might have been removed or filtered out.

Thank you for your assistance!

Best regards, Fang

apcamargo commented 2 months ago

The summary files should only include sequences classified as viruses (<prefix>_virus_summary.tsv) or plasmids (<prefix>_plasmid_summary.tsv). Sequences not present in the summary were either not classified as viruses or plasmids, or they were classified but didn't pass the post-classification filters. These filters can be disabled by using the --relaxed flag.

The taxonomy file only contains sequences that were assigned to a taxon. Sequences missing from this file did not match any taxonomically-informative markers. If you expected all sequences to match a marker, you can try increasing the search sensitivity (e.g., -s 7), but this will increase execution time and memory usage.