TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

runtime in TEstrainer ultralong for *some* genomes #135

Open estolle opened 2 months ago

estolle commented 2 months ago

Hi,

After running EG with v4.2.4 for a while without major problems, we upgraded to 4.4.0 recently and have a couple issues.

It seems one issue is with TEstrainer where run time explodes into weeks/months by going always into the next round with increased runtime each round. This is the case only in very few genomes and thus far we could not pinpoint the real underlying issue. In the log it hangs on 99% on some rnd5 repeat. In different runs from the same genome its sometime a TC4 Mariner, or ts a LINE/R1. I cant see an obvious reason for this.

Our genome is small (>400Mb) and not really overloaded with repeats. Other genomes run normal. We have 3 publicly available Bumblebee genomes which seem to have the same issue.

RepeatMasker is v.4.15 (shipped with conda) configured with a repeatlib dfam3.7 and RepBase Earlgrey (v4.4.0) is invoked like this: earlGrey -g $REF -s $SPECIES -o $OUTPUT -t $CPUs -c yes -m yes -d yes RepeatMasker is v.4.15 (shipped with conda) configured with a repeatlib dfam3.7 and RepBase

image

To align: run1 1071 Run2 768 Run3 655 Run4 590 Run5 532 Run6 494 Run7 462 Run8 422 Run9 381 Run10 340

we are at run 10 now in the ./Osmia.cornuta_strainer/TS_Osmia.cornuta-families.fa_5884/". Between each run, the number of repeats to align is decreasing, but the running time for each run to be completed is increasing ?exponentially.

For few genome we see also RepeatModele failing (after round 5), in Earlygrey 4.2.4 it seems to have worked fine RepeatModeler failed (all 3 genomes are high quality/highly contiguous and should have 15-30%repeats) Bombus affinis GCF_024516045.1 Bombus dahlbomii GCA_037178635.1 Bombus hortorum GCA_905332935.1

RepeatMasker failed Bombus bicoloratus, Masurca assembly based on Illumina reads from PRJNA508540 (here, assembly qualit and lack of repeats may explain the issue)

Any pointers what the issue may be? Thanks alot!

estolle commented 2 months ago

One of the species which took extremely long but eventually finished after 10 days @ 25 threads: Bombus hypnorum: GCA_911387925.2

with EG v4.2.4 with RM 4.16 it took 33 hours @ 40 CPUs

from the log I gathered that these were the steps which were run and how it looks like when its getting stuck at 99%

       <<< Straining TEs and Refining de novo Consensus Sequences >>>
Splitting run 1
Initial trf check for 1
Initial blast and preparation for MSA 1

0% 0:1229=0s rnd-1_family-5.fasta                                               sh: /dev/tty: No such device or address
sh: /dev/tty: No such device or address
0% 0:1229=0s rnd-1_family-5.fasta                                               Hold your horses, rnd-1_family-0#RC/Helitron is likely a tandem repeat
/home/estolle/progz/conda_envs/earlgrey/lib/python3.9/site-packages/pyranges/methods/init.py:45: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  return {k: v for k, v in df.groupby(grpby_key)}
0% 1:1228=0s rnd-1_family-2.fasta                                               sh: /dev/tty: No such device or address
sh: /dev/tty: No such device or address
0% 1:1228=20m28s rnd-1_family-2.fasta                                           Hold your horses, rnd-1_family-16#Unknown is likely a tandem repeat
/home/estolle/progz/conda_envs/earlgrey/lib/python3.9/site-packages/pyranges/methods/init.py:45: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  return {k: v for k, v in df.groupby(grpby_key)}

99% 1228:1=0s rnd-5_family-975.fasta                                            sh: /dev/tty: No such device or address
sh: /dev/tty: No such device or address
99% 1228:1=0s rnd-5_family-975.fasta

# at another run it was this repeat (incl sequence):
stalling at 99% rnd-5_family-978
>rnd-5_family-978#DNA/TcMar-Tc4 ( Recon Family Size = 21, Final Multiple Alignment Size = 21 )
>rnd-5_family-978
CCTGGTCCCAGAAAGTCCGTAATAATATTAATGATATTAATTTACTTCTT
CAATACGTTAGACACATGAATGGCGCACCTGAAGCTTCATATTCTTCACG
ATCCCAATGAAAATGTACATGTAACATACTTATAGTGGTCAAAATATGTA
ATATAGTGGAATTCGGTTATTATAATCACTTTGGAACTTGTAGGAGAGGG
GAAAAAAAGGCACTTCGTCTCGGGATCGTTTGTCAGATTCGTCAGTGGGG
CTGACAACAGACGATCCCTTTCTTTTTTTGTTAAATCGTTTTACTGTTTT
GTGTTATGCAATGGTACTTTATCGGTACTTAGCCTATGTGAATACAAAAA
CTAACTTAAAACAAATTCAACTCTCTCTCTTTGGTAATTCACAGTCAGCT
TACACGTTACTGGACGCGACTCGAAGGCTAAATCTATTTAGGGTTCTACC
GCATTGAACAACTTATAGACTAACACATTGTGCAGTGGGCTCCGCTCTGG
GGCAAGGGGAAAGAAATAAAAATTTAAGGGACAAGTAAAAACCCCGCGGC
TCACAAGTAAAGGTCTCAAATGGACTAATGCAGTTGCAGACTGGGACATA
TACCATTCGAAAGCAGTAAGTGTAACTAAAATTTGAGCAAAATTTTAGGT
GCAATGGGCGAAGGGTTTAGAAATGATGGGCCTTTGAAGTTAACATGGGC
TTCTATGGCAAATGTCACAAAACGTTTATTTTATCTGTTTTTTCATGCTC
AACGATTTAAAAACATGGCAGACGAGTAAAAAATCTTCAAGTTGCAATGC
AAATTGTTATTTTGAAACTGATGCACTACTTATTACAATCAATGTGTAGT
TCAGAAAACAGGCAGTGTTTCATACAGATACTCATGTCACAATGAGCGCA
CCTGACGAAAGCAATAGAATTGCAATCAGCTTCGAAGCACTTGTGGTTCA
CAATGCCATCGAAACAGAAGTCGTACGGCGCGATCCATGTTCCGGGTTCC
TCATCCAGGTATCCTGATTTGTGGAATGCAAATCGAATCATGTTCCTGTC
AGTCTGCAACTGCATTAGTCCATTTGAGACCTTTACTTGTCAGTCTTATA
AATTTGTTTTTCTTAATCTAACTTATACTAACTTATTGCTAGGTGGGTGG
GGTTTGTACAGGGGGAAGCTTATGAACTAACAGTTCTATGTACAATGGGG
ATCAGCAATTGGGCTATCTCTATTTATGTACTATGTGCGATTCTTACGCA
TAATTAGAGCGGGTAGGTGGACGCTATTCATGGATACCCACGAGTATTTT
TTAAAATTGGTCCGTGAGAAATCAGTTTTGTCCCTAGGAGGGATAGCGGG
TGAGAATTTAAAATTAATGTGGCCTGGTCGAGTATTCGCGGCGCTAAGTG
CGAGTCGCTTGTTGGCTTTGTTACGTCGCCAATGATAGATGATAGGTATG
TTTAGTGCCGCAATAGGTGAAGGCTTGTGGGGTAACGTAACCTGTAGCAA
TTTGTCGCAGTGCGGAAGGAGGGTCAGTGTCAGCGAGTTGCTGAATGATT
GGGTTCGGAATCTTGGGAAGTTAACAAAAATAGTTCCTGATGAGCATTAG
CATGAAATTATCTATTCTTGGGATATTGACTTTATTGAAGATGGTTGAGT
TGTTGACATATCGTTGGTGTATGCGANTTGCGTGGTCTGCGAGGTCNGAT
AAAACGATACGCCGTTACGTAATACAACAGTTTATTTACAACACAGAAAA
CTATTTACAAG
TobyBaril commented 2 months ago

hmm, this is a weird one, but there have been some interesting edge cases with TEstrainer... @jamesdgalbraith might have some better insight...

TobyBaril commented 2 months ago

Regarding the failures with RepeatModeler and RepeatMasker - can you successfully run these programs on these genomes in isolation? If so, it could be an issue with the random seed that was used in the Earl Grey run picking up some weird features in the input genome. Regarding RepeatMasker failing, does this also run in isolation successfully?

TEstrainer is likely hanging due to some strange genomic features or weird repeats. Do the failing genomes group together phylogenetically? If so there is potentially something strange restricted to these ones. This could be a biological issue causing a computational one...

estolle commented 2 months ago

Hi,

We have not yet run these separately in RM to test this but will try now.

FOr the TEstrainer elongated runtimes ... it seems a bit random, they are not really phylogenetically related. and thus far I have no hint what the issue is. It seems these are all fairly new genomes, i.e. quite contiguous with perhaps longer stretches of tandem repeats or something. Some genomes are public (see above) and some we generated ourselves (Nanopore) - if it would be helpful we are happy to share for the purpose of troubleshooting.

TobyBaril commented 1 month ago

Okay great, it would be good to know if the RM runs complete successfully outside of the pipeline. An alternative might also be to increase the minimum # of sequences required to generate a new consensus. The default is 3 and can sometimes pick up segmental duplications leading to long extensions. This can be changed with the -a flag in the latest versions (I think this was added in 4.4.4 maybe...).

TobyBaril commented 1 month ago

Hi @estolle, some of the solutions suggested in #145 might help for these strange genomes. A combination of reducing the # of iterations (for most genomes, 5 rounds with -i should be fine as that equates to extending a consensus by a max of 10bk) and also changing the min number of sequences might help in this scenario!