apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

Recommendations for subsampling BIG input file #91

Closed wyanren closed 1 month ago

wyanren commented 2 months ago

Greeting!

Many thanks to genomad for its user-friendly command line interface and its sophisticated handling of MGE datasets.

Currently, I am currently planning to analyze a 100GB genome file, aiming to identify the proportion of false chromosome segments within bins. Based on my estimations, this analysis could take at least a week on a server equipped with 40 threads.

Any guidance or shared experiences with similar tasks would be greatly appreciated :)

apcamargo commented 2 months ago

You can find some tips on how to reduce execution time here. In summary, you can disable the neural network module and reduce the sensitivity of the marker annotation. Both will hurt classification performance a bit, but it's a compromise to be able to run a a large dataset.

That said, I've run geNomad before on larger inputs. You can try to just disable the neural network and leave the search sensitivity with the default value. You may also experiment with splitting the input using seqkit split2.