Round-5 issue in small genome

AlexdeMendoza commented 3 years ago

Dear RepeatModeler developers,

I have been using RepeatModeler for a while now, and I can get it to run on many organisms / genomes. I have also updated to RepeatModeler 2. One thing that I would recommend you to do, is specify in your installation instructions that "TRF" has to be in your PATH named "trf", since this is how LTR_retriever will call it, no matter what the configuration file specified in RepeatModeler2 (my copy was trf409.linux64).

In any case, I've come across this genome where the software always breaks at the same step:

========================
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 81315710 bp
FastaDB::compact - Error could not locate file /mnt/eql/stiletto-scratch/amendoza/RepeatModeler2_test/RM_82846.ThuFeb40147522021/round-5/sampleDB-5.fa!
 at /home/amendoza/working_data/RepeatModeler-2.0.1/RepeatModeler line 837.

The round-4 folder is filled with results, but round-5 only has two empty files: "sampleDB-5.fa" and "sampleDB-5.fa.entry_batch". I have run the pipeline in the same computer many times, for genomes with similar sizes (smaller, larger), and I have tried this several times on this genome, so I am not sure what it is going on. I get a "consensi.fa" out of this process, however, I would like to see it finish and see how the new LTR module works on this genome. Any tip on what might be going wrong?

Thanks in advance for your help.

Alex

jebrosen commented 3 years ago

One thing that I would recommend you to do, is specify in your installation instructions that "TRF" has to be in your PATH named "trf", since this is how LTR_retriever will call it, no matter what the configuration file specified in RepeatModeler2 (my copy was trf409.linux64).

Hm. Yeah, we ought to fix this. LTR_retriever used to always bundle and run its own copy of TRF, and the new version doesn't try different filenames in the same way RepeatModeler does. It looks like the latest version of LTR_Retriever does support a command-line override instead, so we should be able to use that to point to the TRF program configured for RepeatModeler instead of whatever one is on PATH.

In any case, I've come across this genome where the software always breaks at the same step (...) I have run the pipeline in the same computer many times, for genomes with similar sizes (smaller, larger), and I have tried this several times on this genome, so I am not sure what it is going on.

To be sure, every other genome has worked but this particular genome always fails? Is this a genome that is publicly available or that could be shared with us for troubleshooting purposes?

AlexdeMendoza commented 3 years ago

Yes, it worked in many other genomes, and I have tried this particular genome several times and it always dies at the same step, that's why I am surprised. It is public, you can download it from here: ftp://ftp.ensemblgenomes.org/pub/protists/release-49/fasta/protists_heterolobosea1_collection/naegleria_gruberi_gca_000004985/dna/Naegleria_gruberi_gca_000004985.V1.0.dna.toplevel.fa.gz

Thanks for having a look!

A.

jebrosen commented 3 years ago

@AlexdeMendoza Thanks, I have been able to reproduce this issue!

RepeatModeler uses a sampling approach with larger samples each round, without re-using the same sample twice. This genome size or structure falls into a "sweet spot": it seems there was enough un-sampled sequence remaining after round 4 that RepeatModeler went on to round 5, but at the start of round 5 it turned out there were not a sufficient number of long enough sequences (>40Kbp) remaining after all. This may be because most of the contigs are very small, but this is a bug: RepeatModeler ought to be stopping after round 4 here instead of failing.

One workaround for this genome is to add some parameters to your command-line: -genomeSampleSizeMax 27000000. This causes RepeatModeler to stop after round 4, avoiding the error. This should also still work in combination with the -LTRStruct option.

AlexdeMendoza commented 3 years ago

Thanks for having a look! I will give it a shot. I was playing with the -genomeSampleSizeMax parameter, but not really knowing what I was doing. Is there a way to compute this value beforehand?

Another thing that I have noticed is that every time I try to do -recoverDir RMxxxxxx/ I always get the error log that says that this run has not gone past round-1 and better if I start from scratch. But that's never the case (I check and there are several round-n folders in the directory).

Also, it is quite unclear how to run the LTRstruct pipeline on its own, has no example code, it would be good to troubleshoot some installation issues with software like Ninja or LTR-retriever in some instances. (I know that running it alone won't do the merging with the rest of the pipeline).

jebrosen commented 3 years ago

I was playing with the -genomeSampleSizeMax parameter, but not really knowing what I was doing. Is there a way to compute this value beforehand?

At this time the default round sizes are 40,000,000bp (round 1, with RepeatScout), then 3000000, 9000000, 27000000, 81000000, 243000000 -- mutiplying by 3 each time -- for rounds 2-6, with the RECON program. So I picked the size corresponding to round 4. RepeatModeler does try to stop early if necessary; but for this particular file that wasn't detected correctly.

Another thing that I have noticed is that every time I try to do -recoverDir RMxxxxxx/ I always get the error log that says that this run has not gone past round-1 and better if I start from scratch. But that's never the case (I check and there are several round-n folders in the directory).

I think I have seen this before, but I can't find the issue about it. In your case round 2 did not find any elements, but round 3 did and it could have been recovered. round-2 is checked before round-3, and RepeatModeler is interpreting that as failure too quickly.

Also, it is quite unclear how to run the LTRstruct pipeline on its own, has no example code, it would be good to troubleshoot some installation issues with software like Ninja or LTR-retriever in some instances.

This one is actually pretty straightforward to run standalone to test that it works:

LTRPipeline genome.fa

The current version of LTRPipeline produces the two files genome-ltrs.fa and genome-ltrs.stk and an LTR_<date> intermediate output directory.

AlexdeMendoza commented 3 years ago

Thanks again for all the details.

Regarding the LTRPipeline, it would be good if it could be run in a way that merges the results with prior RepeatModeler runs. For some large genomes, this modularity could save a lot of time (recomputing everything from scratch can take ages, or one process might break due to time-out, RAM peak, etc..., gt is pretty RAM intensive for example).

Dfam-consortium / RepeatModeler

Round-5 issue in small genome #118