TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

Compared with repeatmodeler + repeatmasker, earlgrey annotated the bases masked of the transposon at a lower proportion #117

Closed wanjiujiu closed 4 months ago

wanjiujiu commented 4 months ago

Hi, That is a very nice and smart tool, I like this tool,But in running this tool, I have some doubts. When repeatmodeler + repeatmasker and earlgrey were used respectively on the same genome, earlgrey annotated fewer transposons

e.g from RepeatMasker_Against_Custom_Library 1719928275749 from repeatmodeler+repeatmasker 1719928426307 I'd appreciate it very much if you could help me answer

wanjiujiu commented 4 months ago

I've run this on several species Here is the script I ran earlGrey -g cs.fa -s cs -o ./ -t 96 -l ./earlgrey/share/RepeatMasker/Libraries/RepeatMasker.lib

TobyBaril commented 4 months ago

Hi,

there’s a few reasons you might observe the above behaviour. Firstly, you are looking at the RepeatMasker output for earlgrey before the LTR structural and defragmentation steps have been performed, the pipeline outputs are all found in the summaryFiles folder as described in the documentation. Secondly, you are currently using the whole RepeatMasker library as an initial masking library which is not recommended. In this case, you either need to use the ‘-r’ flag and specify a species of interest for an initial mask (if you want to use all known repeats, you can specify eukaryota), or if a good library doesn’t already exist for your species of interest, the recommended approach for high quality TE annotation is to only perform a de novo annotation to provide the maximal information for de novo curation. Be aware that pre masking can detrimentally impact the quality of de novo curation and lead to miscalculation of TE divergence, regardless of the approach or pipeline used. Third, Earl Grey will filter the de novo TEs annotated by RepeatModeler as some could be satellites, duplications, or even host genes, which is why we still recommend some level of manual curation following any automatic repeat curation. Generally, we use highly stringent filters to reduce the risk of false positive annotations which could be present in raw repeatmodeler outputs. In all cases, care should be taken to avoid any false annotations as all automated pipelines are prone to false positive annotations - remember more isn’t necessarily better if the annotated repeats are not really transposable elements.

I recommend checking your repeat libraries to determine which loci and not consistent, and looking into the annotations to determine what might lead to them being removed in each case. It is expected to remove poor quality annotations with Earl grey generally, as we aim to have higher confidence in what we do annotate. It may also be required to change some parameters for your particular species. I would also recommend rerunning the annotation without premasking with the whole of Dfam, as this can affect the quality of the de novo library afterwards

wanjiujiu commented 4 months ago

It was very kind of you to get back to me so quickly~~~ In response to your first possible reason, are you saying that the file I viewed in RepeatMasker_Against_Custom_Library is not the final result? But I can't find the corresponding .masked, .out, .tbl files I want in the summaryFiles folder you mentioned, can I generate these files based on the existing files in them? Thank you very much for your second suggestion!And in response to your suggestion, I'd like to ask if you mean that the second of the 14 steps may affect the quality of the library later on, so it's recommended not to do this step is it? Here's two scripts I modified based on your suggestion, which one do you think is more appropriate

  1. earlGrey -g cs.fa -s cs -o ./ -r 6447 -t 72 -d yes
  2. earlGrey -g cs.fa -s cs -o ./ -t 72 -d yes Once again, I'd like to express my gratitude to you for optimizing the TE annotation process and for your patience!
TobyBaril commented 4 months ago

The final results for Earl Grey are found in the summaryFiles directory. It is described in the documentation on the README of this repository, along with a description of each file you can find in there after a successful run (One of the first sections on the main page - https://github.com/TobyBaril/EarlGrey/). The final result is NOT a RepeatMasker run, so the file names will not be RepeatMasker outputs - there are several post-processing steps to refine annotations, which are described in the paper (https://academic.oup.com/mbe/article/41/4/msae068/7635926) under the implementation section. In this directory, you will find summary plots showing TE activity and quantification, along with annotations in GFF3 and BED format, and quantifications by TE family and superfamily. I encourage you to explore the output files to understand how Earl Grey outputs the results.

It depends on your species of interest. If you are annotating something for which a good TE library already exists, use an initial mask (e.g for drosophila melanogaster, human, mouse, etc). If you do not have a high quality library at the species or genus level, it is recommended to run a totally naive de novo search with no repeatmasker search term, as this will give the de novo search more information to correctly generate consensus sequences and estimate divergence. This will prevent instances, for example, where a nearby species library contains a DNA element that is not identical but closely related to one in your species, so it is masked as that element family rather than a family from your species of interest. In some cases this can lead to partial masking, so the de novo tool will detect the other part and make a partial consensus as some of the information is masked. Then when the final masking is performed, this can make the DNA element look more diverged than it actually is, as it is being compared to a different family from a different species rather than the family from itself, which will likely have experienced different selective pressures and be on a different evolutionary trajectory.

wanjiujiu commented 4 months ago

Thank you very much for your patient answer. According to your advice, all my problems have been solved! Thank you again for your contributions in the field of TE!