Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
184 stars 23 forks source link

Exceeding cluster time limit & understanding # of rounds #69

Open sjfleck opened 4 years ago

sjfleck commented 4 years ago

Thank you for developing such a useful program. This is my first time running RepeatModeler and I'm still working out how to use it to the fullest extent. I'm really glad that you have the "-recoverDir" option and it's working perfectly fine for me. All three of my genome assemblies made it to round 6 before hitting the time limit of 72 hours on my university's cluster. Here lies the issue. After restarting round 6 last night, I checked just now and it's only at 5% with an estimated 130 hours to go. This round will never finish with my current available resources.

I think that I'm running the job at its full capacity. I have maximum of 40 threads/cores available per job submission, so I ran all three genomes with the "-pa 10" option.

Here are my lines in my script fo the job I've submitted (I'm not including the variable paths):

_$RepeatModeler/BuildDatabase -engine rmblast -name My_species_01 -dir $pilon $RepeatModeler/RepeatModeler -pa 10 -LTRStruct -database My_species_01 $RepeatModeler/RepeatModeler -pa 10 -recoverDir $recoverDir -srand 1584548300 -LTRStruct -database My_species01

I think I followed the manual and usage properly, but I wanted to post what I was submitting just in case. I also want to add that the genome sizes for my three assemblies are between 540 MB and 707 MB.

I'm not finding this specifically, but how many rounds are there supposed to be? is it possible to stop after 5 rounds? I plan on using RepeatMasker next, followed by PurgeHaplotigs (which over purge all three of these assemblies, which is why I'm using RepeatModeler/RepeatMasker). Any insight into these issues would be appreciated. Thank you!

jebrosen commented 4 years ago

That time limit is unfortunate!

I'm not finding this specifically, but how many rounds are there supposed to be? is it possible to stop after 5 rounds?

The default is 6 rounds, with 243Mbp of the genome processed in round 6. You could shorten the total number of rounds by using the genomeSampleSizeMax parameter: RepeatModeler ... -genomeSampleSizeMax 81000000 (81Mbp is the default size of round 5). This will add up to a total 160Mbp being sampled from the genome instead of 403Mbp, so you may miss some repetitive elements depending on the abundance and diversity of repetitive DNA in that particular genome.

In case those three commands are all in the same job, it would also be a good idea to split up BuildDatabase from RepeatModeler. If there is no way to get a longer time limit temporarily, -genomeSampleSizeMax is probably your best bet.

sjfleck commented 4 years ago

Unfortunately, I'm already running those three commands in different scripts so round six is getting the full 72 hours to itself. Based on your response, it seems like my run time for round 6 isn't abnormal for plant genomes of these sizes (540 - 707 Mbp). I know RepeatModeler works best with assembled genomes and they all have been assembled and polished. All three have BUSCO scores around 93% and two have N50s over 2.2 Mbp.

I read through the description from your website (http://www.repeatmasker.org/RepeatModeler/) as well as the usage from the help menu and don't see a break down of how to use -genomeSampleSizeMax effectively. For example, my job will time out in ~12 hours and this is the last line in my .out file:

55% completed, 46:7:34 (hh:mm:ss) est. time remaining.

does this mean that ~134Mbp (55% of 243Mbp) of the genome has been processed? If so, I should be able to find out what percentage it times out at and set it a tiny bit lower than that. Let me know if that sounds feasible. Because I haven't run RepeatModeler to completion, I don't know if round six is the last step in the pipeline and I don't want to not leave time for whatever needs to run next.

Final question: Here are the sample stats printed right before round 6 started its "all-by-other comparisons":

-- Sample Stats: Sample Size 243001461 bp Num Contigs Represented = 752 Non ambiguous bp: Initial: 243001461 bp After Masking: 88070861 bp Masked: 63.76 % -- Input Database Coverage: 403133730 bp out of 667808631 bp ( 60.37 % )

Thanks for pointing out the sample size from round 5, now I see that the sample sizes for: round 1 = 40 Mbp round 2 = 3 Mbp round 3 = 9 Mbp round 5 = 27 Mbp round 5 = 81 Mbp round 6 = 243 Mbp total = 403 Mbp

I did not understand that before, so thank you for pointing that out. It looks like the 6 rounds together cover 60.37% of my genome. Does this mean that any genome larger than 403 Mbp will only have a fraction of its genome processed? And if so, have anyone thought of work arounds or is not really an issue? I've never done it before, but I'm wondering if I can split my assembly fasta files in half and run them separately. Any help would be greatly appreciated. Thank you!

jebrosen commented 4 years ago

does this mean that ~134Mbp (55% of 243Mbp) of the genome has been processed?

That statistic is based on the current round, so it means 134Mbp of round 6 have been processed. Rounds 1-5 totaling to 160Mbp have already been over and done with by that time. That figure is not quite accurate, actually: the first 40Mbp for RepeatScout can overlap the samples used for RECON analysis in rounds 2+ so the total unique sequence processed may be a bit less than 160Mbp.

If so, I should be able to find out what percentage it times out at and set it a tiny bit lower than that. Let me know if that sounds feasible. Because I haven't run RepeatModeler to completion, I don't know if round six is the last step in the pipeline and I don't want to not leave time for whatever needs to run next.

Good catch. -LTRStruct adds an additional analysis (over the whole genome!) that runs after all the rounds of RepeatScout/RECON. It can be run separately to get a time estimate, but the results can't currently be combined "after the fact" and RepeatModeler also can't currently be resumed if that step fails (issue #65).

Does this mean that any genome larger than 403 Mbp will only have a fraction of its genome processed?

It does. There is an underlying assumption there, that repetitive elements in the genome will be widespread and frequent enough to show up even in a sample. We do suggest increasing genomeSampleSizeMax if more coverage is desired.

I've never done it before, but I'm wondering if I can split my assembly fasta files in half and run them separately.

This will not give you the best results. Information from previous rounds informs later rounds, so running two batches of the genome will give you a very redundant pair of libraries as each run usually independently discovers approximately the same elements. This might be okay, if all you are doing is masking - but it will cause issues down the road if you do any work with the library itself or if you do annotation instead of masking.

sjfleck commented 4 years ago

Thank you for the quick reply and I read issue #65. Based on what you've told me, unfortunately I'm going to have to lower -genomeSampleSizeMax in order to get this job completed with my university's limitations. I'm still a little confused about how I'm going to pull this off using -recoverDir.

I want the 6th round to finish, but it seems like I can't risk the LTRPipeline + clustering steps starting and not finishing. Can I run two separate -recoverDir jobs to get everything completed?

Job #1 (recoverDir without -LTRStruct to do round 6) $RepeatModeler/RepeatModeler -pa 10 -recoverDir $recoverDir -srand 1584548300 -database My_species_01

Job #2 (recoverDir with -LTRStruct) $RepeatModeler/RepeatModeler -pa 10 -recoverDir $recoverDir -srand 1584548300 -LTRStruct -database My_species_01

Can I use -recoverDir on a completed job or is it only for the rounds of RepeatScout + RECON steps? I'm suspecting that I can't yet based on issue #65, but I don't think I can have round 6, LTRPipeline, & clustering steps all in the same run...

jebrosen commented 4 years ago

I don't think you can finish round 6 at all, if it would take 130 hours - recoverDir only works on entire rounds and can't restart partway through an unfinished round. Depending on the timing, you might be able to this:

# let it fail during round 6
$RepeatModeler/RepeatModeler -pa 10 -srand 1584548300 -database My_species_01 -genomeSampleSizeMax xyz
# recover the unfinished run
$RepeatModeler/RepeatModeler -pa 10 -recoverDir $recoverDir -srand 1584548300 -database My_species_01 -genomeSampleSizeMax xyz

Where xyz is small enough to complete round 6 + the LTR pipeline in the second run, but large enough that it fails in the first run so that there is something to recover in the second.

sjfleck commented 4 years ago

OK, I understood that the round starts over when when you use -recoverDir and that I also need to lower -genomeSampleSizeMax because of my 72hr limit. I apologize for not being clear that I understood those parts.

Just because this job takes so much time on my cluster, I want to ask one final question before submitting my job. In your experience, how does the LTR pipeline compare with round 6 in terms of time? I'm just trying to make an estimate of how much I should lower the -genomeSampleSizeMax option. Thank you again for all your help.

jebrosen commented 4 years ago

how does the LTR pipeline compare with round 6 in terms of time

I'm sorry to say that I have no data for this on hand. I can look around for a few previous run results, but I believe it depends on genome composition rather than just size so it will be hard to extrapolate.