TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

Limiting Repeatmodeler rounds to 4/5 ?? #96

Closed DhakadPankaj closed 7 months ago

DhakadPankaj commented 7 months ago

Hi,

I'm running Earlgrey on chymomyza amoena (~380mb) assembly but the repeatmodeler is taking too much runtime. Is there a way to limit repeatmodeler rounds in this pipeline. I have around 300 genomes of around same size so thought If I could reduce the time for each genome without compromising too much on TE annotations.

Thanks!

TobyBaril commented 7 months ago

Hi,

The default runtime is set in RepeatMasker as 6 rounds. There are some triggers in Earl Grey to run less rounds if the full run fails. Generally I wouldn't recommend running less rounds, as each ReMo round gradually samples more of the input genome, often detecting new TE families, particularly if they are of low copy number. Reducing the rounds does indeed run the risk of compromising on the quality of your TE library, but this depends on your overall question - if you are only interested in highly abundant TE families then this should be fine, but if you care about the full TE repetoire you could potentially be missing a lot.

One way to check the impact would be to run RepeatModeler with 6, 5, and 4 rounds on the same assembly and then compare the number of consensi, their lengths, and classifications, to see whether the subsampling has a significant or acceptable impact on your annotation results.

If you want to modify Earl Grey to use less ReMo rounds, add -genomeSampleSizeMax 81000000 to line 143 of the main script to limit to 5 rounds, or -genomeSampleSizeMax 27000000 to limit to 4 rounds. However, I would not recommend this without benchmarking the impact of reducing the round number on your genomes of interest first.

DhakadPankaj commented 7 months ago

Hi, Thanks for the suggestion! I'll check the impact of reducing rounds on TE repetoire in few bigger genomes. Also, I think for most of my genomes max round 5 will be enough as assembly size is ~150 Mb ??

TobyBaril commented 7 months ago

Great! At 150Mb you'll be sampling the whole genome with 5 rounds (~160Mb limit)