Recommendations for subsampling BIG input file

apcamargo / genomad

geNomad: Identification of mobile genetic elements

Other

169 stars 17 forks source link

Greeting!

Many thanks to genomad for its user-friendly command line interface and its sophisticated handling of MGE datasets.

Currently, I am currently planning to analyze a 100GB genome file, aiming to identify the proportion of false chromosome segments within bins. Based on my estimations, this analysis could take at least a week on a server equipped with 40 threads.

To speed up this process, I am considering subsampling the genome data — I've noticed that genomad employs an ultra-fast DIAMOND run for protein searches. I wonder if I could run DIAMOND Blastx to prefilter & subsample my input.
Also, I am curious whether removing excessively long contigs could further streamline the MGE analysis.

Any guidance or shared experiences with similar tasks would be greatly appreciated :)

apcamargo / genomad

Recommendations for subsampling BIG input file #91