Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
189 stars 22 forks source link

Huge cd-hit-stdout with basically no content - should I kill these jobs? #152

Open KatharinaHoff opened 3 years ago

KatharinaHoff commented 3 years ago

Are Terabytes of cd-hit-stdout "normal"?

I recently ran RepeatModeler2 on about 240 mammalian genomes. In the first attempt, I used nodes that can parallelize on 72 threads and that have 189 GB of RAM. That worked for 235 of the genomes (although RM2 does not scale well to that level of parallelization... it often looks like nothing is happening in terms of load on the nodes, particularly in the first round... but there's an open issue about that already, never mind, the jobs finished ok).

For the other five, the job was killed, either because RAM was exceeded, or because the 3 days runtime limit was exceeded. I therefore moved the five jobs to different nodes. They are now running with 8 threads & 500 GB RAM, and without runtime limit. I have only 2 of these nodes... and three jobs are still in queue because the first two have been running for >7 days. For a minimum of two days, these jobs are in the stage of "-- Clustering results with previous rounds..." (which is close to finishing, yeah...)

The cd-hit-stdout of these jobs is huge, we are talking about terabytes. For Ammotragus lervia, it's currently at 6.8T, for the other job it's even bigger. The content of that file is pretty much like this:

^Mcomparing sequences from          0  to          0
^Mcomparing sequences from          0  to          0

It contains this line over and over and over again, and there is nothing else (the file header is alright, I didn't copy it because it looks totally fine, like with jobs that worked well).

My question is therefore: is it normal to have those HUGE cd-hit-stout files? Can I expect these jobs to ever finish ok, or should I kill them and use a different sampling parameter? (I would highly prefer to keep it consistent across all 240 genomes but maybe that's impossible with my hardware configuration, not sure...)

If it is "normal" I suggest that you maybe redirect that output to /dev/null in the future, since it seems pretty pointless...

I definitely observe this for Ammotragus lervia & Bos mutus, I cannot say for sure but expect it for Capra aegagrus, Ovis aries & Pantholops hodgsonii as well.

Here's the software configuration from log:

RepeatModeler Version 2.0.2
===========================
Search Engine = rmblast 2.11.0+
Dependencies: TRF 4.09, RECON , RepeatScout 1.0.6, RepeatMasker 4.1.2
LTR Structural Analysis: Enabled ( GenomeTools 1.6.2, LTR_Retriever v2.9.0,
                                   Ninja 0.95-cluster_only, MAFFT 7.475,
                                   CD-HIT  )

Please let me know whether you think these jobs will ever end ;-)

Best regards,

Katharina