Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
189 stars 22 forks source link

Segmentation Fault and Splitting #68

Open BioFalcon opened 4 years ago

BioFalcon commented 4 years ago

Hi,

I have been having problems while running RepeatModeler, especially when getting to the LTR pipeline. gt ltrharvester step seems to die due a segmentation fault (guessing because of the size of the genome I'm working with), so I'm performing test to pinpoint the upper limits that ltrharverster can take. So far it doesn't seems to be a problem regarding the size of the sequences, but about the volume of the sequences (completed a run with the longest fragment). The question is: is it possible to split the genome and merge the results somehow?

jebrosen commented 4 years ago

Is the genome you are working with public, or do you know of any similarly large genomes we could try and reproduce this with?

As for splitting: the RepeatScout+RECON steps of RepeatModeler already use a sampling approach that observes up to around 400Mbp of the input genome overall, but the LTR discovery pipeline operates on the entire input. In principle the results of running RepeatModeler on a 5Gbp sample may be usable. Whether it will be a representative library depends on the copy number and evenness of spread of repeats in the specific genome you are working with.

BioFalcon commented 4 years ago

Hi, sorry for the very very late response, I had to meet other obligations but now I'm coming back to the project and retaking where I left off. It seems that the problems with the pipeline come from LtrHarvest, which I think you guys are not in charge of maintaining, as it stops running due to a segmentation fault. A work around I found was to to split the genome into chunks to run the pipeline, but I'm wondering if there is a way to merge the results from different runs into a single set of repeats.

jebrosen commented 4 years ago

It seems that the problems with the pipeline come from LtrHarvest, which I think you guys are not in charge of maintaining, as it stops running due to a segmentation fault. A work around I found was to to split the genome into chunks to run the pipeline

Yeah, we have seen this once or twice before (cc genometools/genometools#940). Maybe your information or data files could help the folks at genometools with troubleshooting or solving it.

I'm wondering if there is a way to merge the results from different runs into a single set of repeats.

Usually merging is the wrong approach, partly because of the sampling approach used for RepeatScout+RECON:

It is not recommended that a genome be run in a batched fashion nor the results of multiple RepeatModeler runs on the same genome be naively combined. Doing so will generate a combined library that is largely redundant. The -genomeSampleSizeMax parameter is provided for the purpose of increasing the amount of the genome sampled while avoiding rediscovery of families.

BioFalcon commented 3 years ago

Hi, I'm just revisiting this issue after almost a year of avoiding it haha. I asked in #genometools/genometools#940, and basically their answer was to split the job into several parts. I was thinking that a solution would be to split the genome in several sections just as RepeatModeler does for the masking every round. I'm working on a script to implement it but still debugging and testing it. Do you guys think this could be implemented into RepeatModeler (knowing that most big genomes have a huge amount of LTRs and thus this element is quite important)? Also, how do you feel about parallelisation inside perl?