Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

Long Comparison Time #243

Open wwyyxx517 opened 1 month ago

wwyyxx517 commented 1 month ago

Hi,

I am running manually installed RepeatModeler on an 800 Mb spider genome.

BuildDatabase -name temo -engine ncbi TEMO.fna
RepeatModeler -threads 64 -database temo -engine ncbi

But it seems to have been forced to end due to its long runtime on the cluster. The output says the comparison time is 84 h in round 4 and more than 600 h in round 5. job.21576920.out.txt

I've tried the command BLAST_USAGE_REPORT=false When I tested it on a 50 Mb fungal genome, it successfully reduces the runtime from 45 h to 11 h. But it doesn't work on the spider genome.

Also, I am trying another spider genome of similar size, but the problem still appears. I don't know how to solve this.

Thanks in advance.

rmhubley commented 2 weeks ago

Wow...this is interesting. You are doing everything right as far as I can see in your output. The problem appears to be the lack of repetitive sequence in the assembly you have. In round 1, no families were identified with RepeatScout. That's really surprising, as that should catch the most abundant elements and allow RepeatModeler to mask them out in the progressively larger samples it will analyze in rounds 2-5. This is why your runtime is going through the roof. In this case it should probably have given up after round 4. What is really intriguing is that by round 5 you have developed a TE library that masks 40% of the next round's genome sample. Either there are a huge number of low abundance TE families in the genome, or extreme heterogeneity in the TE density in the genome and an unfortunate random sample picked in the early rounds. There is no harm in ending a run early. You can always run the RepeatClassifier tool on the generated library by hand (the last step after the final round ). That's an interesting critter....let me know what you find.