TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
130 stars 19 forks source link

runtime in TEstrainer ultralong for *some* genomes #135

Open estolle opened 1 week ago

estolle commented 1 week ago

Hi,

After running EG with v4.2.4 for a while without major problems, we upgraded to 4.4.0 recently and have a couple issues.

It seems one issue is with TEstrainer where run time explodes into weeks/months by going always into the next round with increased runtime each round. This is the case only in very few genomes and thus far we could not pinpoint the real underlying issue. In the log it hangs on 99% on some rnd5 repeat. In different runs from the same genome its sometime a TC4 Mariner, or ts a LINE/R1. I cant see an obvious reason for this.

Our genome is small (>400Mb) and not really overloaded with repeats. Other genomes run normal. We have 3 publicly available Bumblebee genomes which seem to have the same issue.

RepeatMasker is v.4.15 (shipped with conda) configured with a repeatlib dfam3.7 and RepBase Earlgrey (v4.4.0) is invoked like this: earlGrey -g $REF -s $SPECIES -o $OUTPUT -t $CPUs -c yes -m yes -d yes RepeatMasker is v.4.15 (shipped with conda) configured with a repeatlib dfam3.7 and RepBase

image

To align: run1 1071 Run2 768 Run3 655 Run4 590 Run5 532 Run6 494 Run7 462 Run8 422 Run9 381 Run10 340

we are at run 10 now in the ./Osmia.cornuta_strainer/TS_Osmia.cornuta-families.fa_5884/". Between each run, the number of repeats to align is decreasing, but the running time for each run to be completed is increasing ?exponentially.

For few genome we see also RepeatModele failing (after round 5), in Earlygrey 4.2.4 it seems to have worked fine RepeatModeler failed (all 3 genomes are high quality/highly contiguous and should have 15-30%repeats) Bombus affinis GCF_024516045.1 Bombus dahlbomii GCA_037178635.1 Bombus hortorum GCA_905332935.1

RepeatMasker failed Bombus bicoloratus, Masurca assembly based on Illumina reads from PRJNA508540 (here, assembly qualit and lack of repeats may explain the issue)

Any pointers what the issue may be? Thanks alot!

estolle commented 1 week ago

One of the species which took extremely long but eventually finished after 10 days @ 25 threads: Bombus hypnorum: GCA_911387925.2

with EG v4.2.4 with RM 4.16 it took 33 hours @ 40 CPUs

from the log I gathered that these were the steps which were run and how it looks like when its getting stuck at 99%

       <<< Straining TEs and Refining de novo Consensus Sequences >>>
Splitting run 1
Initial trf check for 1
Initial blast and preparation for MSA 1

0% 0:1229=0s rnd-1_family-5.fasta                                               sh: /dev/tty: No such device or address
sh: /dev/tty: No such device or address
0% 0:1229=0s rnd-1_family-5.fasta                                               Hold your horses, rnd-1_family-0#RC/Helitron is likely a tandem repeat
/home/estolle/progz/conda_envs/earlgrey/lib/python3.9/site-packages/pyranges/methods/init.py:45: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  return {k: v for k, v in df.groupby(grpby_key)}
0% 1:1228=0s rnd-1_family-2.fasta                                               sh: /dev/tty: No such device or address
sh: /dev/tty: No such device or address
0% 1:1228=20m28s rnd-1_family-2.fasta                                           Hold your horses, rnd-1_family-16#Unknown is likely a tandem repeat
/home/estolle/progz/conda_envs/earlgrey/lib/python3.9/site-packages/pyranges/methods/init.py:45: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  return {k: v for k, v in df.groupby(grpby_key)}

99% 1228:1=0s rnd-5_family-975.fasta                                            sh: /dev/tty: No such device or address
sh: /dev/tty: No such device or address
99% 1228:1=0s rnd-5_family-975.fasta

# at another run it was this repeat (incl sequence):
stalling at 99% rnd-5_family-978
>rnd-5_family-978#DNA/TcMar-Tc4 ( Recon Family Size = 21, Final Multiple Alignment Size = 21 )
>rnd-5_family-978
CCTGGTCCCAGAAAGTCCGTAATAATATTAATGATATTAATTTACTTCTT
CAATACGTTAGACACATGAATGGCGCACCTGAAGCTTCATATTCTTCACG
ATCCCAATGAAAATGTACATGTAACATACTTATAGTGGTCAAAATATGTA
ATATAGTGGAATTCGGTTATTATAATCACTTTGGAACTTGTAGGAGAGGG
GAAAAAAAGGCACTTCGTCTCGGGATCGTTTGTCAGATTCGTCAGTGGGG
CTGACAACAGACGATCCCTTTCTTTTTTTGTTAAATCGTTTTACTGTTTT
GTGTTATGCAATGGTACTTTATCGGTACTTAGCCTATGTGAATACAAAAA
CTAACTTAAAACAAATTCAACTCTCTCTCTTTGGTAATTCACAGTCAGCT
TACACGTTACTGGACGCGACTCGAAGGCTAAATCTATTTAGGGTTCTACC
GCATTGAACAACTTATAGACTAACACATTGTGCAGTGGGCTCCGCTCTGG
GGCAAGGGGAAAGAAATAAAAATTTAAGGGACAAGTAAAAACCCCGCGGC
TCACAAGTAAAGGTCTCAAATGGACTAATGCAGTTGCAGACTGGGACATA
TACCATTCGAAAGCAGTAAGTGTAACTAAAATTTGAGCAAAATTTTAGGT
GCAATGGGCGAAGGGTTTAGAAATGATGGGCCTTTGAAGTTAACATGGGC
TTCTATGGCAAATGTCACAAAACGTTTATTTTATCTGTTTTTTCATGCTC
AACGATTTAAAAACATGGCAGACGAGTAAAAAATCTTCAAGTTGCAATGC
AAATTGTTATTTTGAAACTGATGCACTACTTATTACAATCAATGTGTAGT
TCAGAAAACAGGCAGTGTTTCATACAGATACTCATGTCACAATGAGCGCA
CCTGACGAAAGCAATAGAATTGCAATCAGCTTCGAAGCACTTGTGGTTCA
CAATGCCATCGAAACAGAAGTCGTACGGCGCGATCCATGTTCCGGGTTCC
TCATCCAGGTATCCTGATTTGTGGAATGCAAATCGAATCATGTTCCTGTC
AGTCTGCAACTGCATTAGTCCATTTGAGACCTTTACTTGTCAGTCTTATA
AATTTGTTTTTCTTAATCTAACTTATACTAACTTATTGCTAGGTGGGTGG
GGTTTGTACAGGGGGAAGCTTATGAACTAACAGTTCTATGTACAATGGGG
ATCAGCAATTGGGCTATCTCTATTTATGTACTATGTGCGATTCTTACGCA
TAATTAGAGCGGGTAGGTGGACGCTATTCATGGATACCCACGAGTATTTT
TTAAAATTGGTCCGTGAGAAATCAGTTTTGTCCCTAGGAGGGATAGCGGG
TGAGAATTTAAAATTAATGTGGCCTGGTCGAGTATTCGCGGCGCTAAGTG
CGAGTCGCTTGTTGGCTTTGTTACGTCGCCAATGATAGATGATAGGTATG
TTTAGTGCCGCAATAGGTGAAGGCTTGTGGGGTAACGTAACCTGTAGCAA
TTTGTCGCAGTGCGGAAGGAGGGTCAGTGTCAGCGAGTTGCTGAATGATT
GGGTTCGGAATCTTGGGAAGTTAACAAAAATAGTTCCTGATGAGCATTAG
CATGAAATTATCTATTCTTGGGATATTGACTTTATTGAAGATGGTTGAGT
TGTTGACATATCGTTGGTGTATGCGANTTGCGTGGTCTGCGAGGTCNGAT
AAAACGATACGCCGTTACGTAATACAACAGTTTATTTACAACACAGAAAA
CTATTTACAAG
TobyBaril commented 4 days ago

hmm, this is a weird one, but there have been some interesting edge cases with TEstrainer... @jamesdgalbraith might have some better insight...

TobyBaril commented 3 days ago

Regarding the failures with RepeatModeler and RepeatMasker - can you successfully run these programs on these genomes in isolation? If so, it could be an issue with the random seed that was used in the Earl Grey run picking up some weird features in the input genome. Regarding RepeatMasker failing, does this also run in isolation successfully?

TEstrainer is likely hanging due to some strange genomic features or weird repeats. Do the failing genomes group together phylogenetically? If so there is potentially something strange restricted to these ones. This could be a biological issue causing a computational one...