TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
131 stars 19 forks source link

Only generating pie plot and no other summary plots #126

Closed johannabeam closed 1 month ago

johannabeam commented 2 months ago

Hi all,

I've been playing around with this lovely pipeline with some bird genomes. These have been taking quite a long time to run (over 300 hours), and only generated one plot before finishing up. I installed via mamba - version 4.2.4. My command was:

earlGrey -g cardellinaCanadensis.fna -s cardellinaCanadensis -t 5 -o /output_cardellinaCanadensis -r eukarya

My summary files:

7.9K cardellinaCanadensis.summaryPie.pdf
564 cardellinaCanadensis.highLevelCount.txt
622K cardellinaCanadensis.familyLevelCount.txt
27M cardellinaCanadensis.filteredRepeats.bed
62M cardellinaCanadensis.filteredRepeats.gff
35M cardellinaCanadensis_combined_library.fasta
3.3M cardellinaCanadensis-families.fa.strained

End of my log file:

    <<< Generating Summary Plots >>>
 [1] "/storage/home/jkb6436/work/conda_envs/earlgrey/lib/R/bin/exec/R"                                                                                                                         
 [2] "--no-echo"                                                                                                                                                                               
 [3] "--no-restore"                                                                                                                                                                            
 [4] "--file=/storage/home/jkb6436/work/conda_envs/earlgrey/share/earlgrey-4.2.4-1/scripts//autoPie.R"                                                                                         
 [5] "--args"                                                                                                                                                                                  
 [6] "/storage/group/dut374/default/johanna/trans_elemt/eg/output_cardcan/cardellinaCanadensis_EarlGrey/cardellinaCanadensis_mergedRepeats/looseMerge/cardellinaCanadensis.filteredRepeats.bed"
 [7] "/storage/group/dut374/default/johanna/trans_elemt/eg/output_cardcan/cardellinaCanadensis_EarlGrey/cardellinaCanadensis_mergedRepeats/looseMerge/cardellinaCanadensis.filteredRepeats.gff"
 [8] "1138888235"                                                                                                                                                                              
 [9] "/storage/group/dut374/default/johanna/trans_elemt/eg/output_cardcan/cardellinaCanadensis_EarlGrey/cardellinaCanadensis_summaryFiles/cardellinaCanadensis.summaryPie.pdf"                 
[10] "/storage/group/dut374/default/johanna/trans_elemt/eg/output_cardcan/cardellinaCanadensis_EarlGrey/cardellinaCanadensis_summaryFiles/cardellinaCanadensis.highLevelCount.txt"             
Splitting repeat library
Reading in gff
Starting calculations
WARNING. chromosome (scaffold4) was not found in the FASTA file. Skipping.
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/storage/home/jkb6436/work/conda_envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/storage/home/jkb6436/work/conda_envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/storage/home/jkb6436/work/conda_envs/earlgrey/share/earlgrey-4.2.4-1/scripts//divergenceCalc/divergence_calc.py", line 140, in outer_func
    Kdist = Kimura80(str(aln[0].seq).upper(), str(aln[1].seq).upper())
  File "/storage/home/jkb6436/work/conda_envs/earlgrey/share/earlgrey-4.2.4-1/scripts//divergenceCalc/divergence_calc.py", line 89, in Kimura80
    p = ts/aln_len
ZeroDivisionError: division by zero
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/storage/home/jkb6436/work/conda_envs/earlgrey/share/earlgrey-4.2.4-1/scripts//divergenceCalc/divergence_calc.py", line 203, in <module>
    results = pool.map(func, chunks)
  File "/storage/home/jkb6436/work/conda_envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/storage/home/jkb6436/work/conda_envs/earlgrey/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
ZeroDivisionError: division by zero
Error in open.connection(con, open) : cannot open the connection
Calls: %>% ... connection -> connectionForResource -> open -> open.connection
In addition: Warning message:
In open.connection(con, open) :
  cannot open file '/storage/group/dut374/default/johanna/trans_elemt/eg/output_cardcan/cardellinaCanadensis_EarlGrey/cardellinaCanadensis_RepeatLandscape/cardellinaCanadensis.filteredRepeats.withDivergence.gff': No such file or directory
Execution halted
cp: cannot stat '/storage/group/dut374/default/johanna/trans_elemt/eg/output_cardcan/cardellinaCanadensis_EarlGrey/cardellinaCanadensis_RepeatLandscape/*.pdf': No such file or directory

This issue appears similar to #121 but I don't think mine is related to a FASTA index. Is there a way to re-run just the plots without rerunning the whole pipeline? For a larger bird genome (1.3GB) it is taking a very long time (<<< Done in 304:03:20.00 >>>) to complete the program. I currently have one other genome running that should be finishing up shortly so we will see if it has a similar issue or not! Thanks in advance!

TobyBaril commented 2 months ago

Hi,

There are some strange parsing issues sometimes happening depending on python versions at the moment (there have been lots of changes in pandas and numpy recently that have changed things). In this case, you should be able to simply rerun the last step of the pipeline by resubmitting exactly the same command as you did the first time round, and Earl Grey will skip all the steps where the outputs were successfully created, so it your case it should just regenerate plots. @jamesdgalbraith, I guess there might be an issue with dividing by 0 but this could be related to the input not being parsed properly??

Regarding runtime, on large genomes more threads should help if they are available. The runtime is limited by the repeat content of a given genome (which increases as genome size does), and is a result of the RepeatModeler run generally. On genomes >1GB, these can take upwards of a week on less than 16 threads.

I hope this helps!

johannabeam commented 2 months ago

Hi, I was able to re-run the last step but it gave me the same errors as before. Interestingly, with the other genome I was running, the plots were made just fine. Would it be worth trying to update earl grey and rerunning it again?

I had another question regarding the figure output of the repeat landscape plot. This is with the genome that worked (Icteria virens, a songbird). Mine doesn't appear to have the same axis values as the example plot on the homepage (percentage of genome on the y and CpG adjusted on the x). It also seems odd that there is relatively little recent TE activity on the right hand side of the plot, if I am interpreting this correctly. Is there a good way to interpret relative time with the Kimura 2-parameter distance? But, I am very new to TE analysis so I may be misunderstanding something here as well! icteriaVirens_classification_landscape

TobyBaril commented 2 months ago

Hi Johanna,

Good spot! I haven't updated the README plot in a while. We have moved away from the calculations done by RepeatMasker as these apply a CpG adjustment informed by the human genome, which wasn't particularly transferable or representative for other species. The latest version uses classic Kimura 2-parameter distance which should be more informative in a wider range of species.

You are correct in your interpretation, a Kimura distance of 0 shows that the TE is identical to the consensus sequence, and thus has likely inserted more recently, which ancient activity more to the left. There are of course some errors around divergence, because the consensus sequences are our best guess of the ancestral element sequence, but it has been made from copies in the genome, so could already look diverged. However, we can still get a good idea of relative TE activity within the species, and among species, but it is harder to date TE activity without some error.

It looks like there has been some recent LTR activity (green peak ~0.09), with a second more ancient burst of LINEs and LTRs ~0.20ish. This looks fairly typical, with bursts followed by suppression of TE acitivty. It would be cool to see what you find in your comparison species!

Regarding the other genome with the same errors, this is likely an issue with the calculation step for one or two entries - and an edge case that we hadn't yet tested for. @jamesdgalbraith is on holiday at the moment, but is going to look into this when he is back, we will keep you updated

TobyBaril commented 1 month ago

This should be fixed in v4.4.0. Let us know if it is still not working for you with the update!