TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
136 stars 20 forks source link

The progress appears quite slow for large genome #144

Closed joweihsieh closed 2 weeks ago

joweihsieh commented 3 weeks ago

Hi,

Thank you for developing this tool! I've been using EarlGrey for some plant genomes and encountered an issue.

I successfully ran it on a smaller genome (~500MB), but on September 19th, I started a run for a larger genome (~17GB). As of today (September 30th), the process still seems to be running, but the progress appears quite slow.

The end of the log file shows the following: sh: /dev/tty: No such device or address ^M# 43547 sec rnd-1_family-228.fasta ^M10^MESC[7m10% 60:5ESC[0m40=12h05m47s rnd-1_family-228.fasta ESC[0msh: /dev/tty: No such device or address sh: /dev/tty: No such device or address ^M# 43547 sec rnd-1_family-228.fasta ^M10^MESC[7m10% 60:5ESC[0m40=12h05m47s rnd-1_family-228.fasta ESC[0msh: /dev/tty: No such device or address sh: /dev/tty: No such device or address ^M# 43547 sec rnd-1_family-228.fasta ^M10^MESC[7m10% 60:5ESC[0m40=12h05m47s rnd-1_family-228.fasta ESC[0m

I’m wondering if this is expected behavior or if something has gone wrong? Should I restart the process with some adjustments, as this seems unusual? Also, how long should I expect this to take for such a large genome?

Thanks for your help! Jo-Wei

TobyBaril commented 2 weeks ago

Hi Jo-Wei,

This behaviour is likely normal for a genome of this size, as it is very likely incredibly repeat-rich. For reference, we tested the tuatara genome (~25GB) a while ago with 128 cores and 1TB RAM and it took over a month, The human genome also takes around a week (of course depending on the # of cores).

Unfortunately, repeat detection remains very computationally intensive at the moment, so it is just a waiting game for large genomes.

Best wishes,

Toby