TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
125 stars 18 forks source link

EG4.0.8 vs EG 4.2.4 #116

Open ekaralashvili opened 2 weeks ago

ekaralashvili commented 2 weeks ago

I ran the earlGrey pipeline version 4.0.8 (using RM4.1.5) on a genome and now I ran earlGrey 4.2.4 (with RM4.1.6) on the same genome. The resulting plots look very different and I was wondering why that is? I'm attaching both versions of the repeatlandscape and the pie chart.

This is the command I used both times: time earlGrey -g $REF -s $SPECIES -o $OUTPUT -t $CPUs -c yes -m yes -d yes

2024-06-25 16_53_46-CamptopoeumFriesei summaryPie pdf - Adobe Acrobat Reader (64-bit) 2024-06-25 16_54_26-Settings 2024-06-25 16_55_27-Settings 2024-06-25 16_52_36-Camptopoeumfriesei summaryPie pdf - Adobe Acrobat Reader (64-bit)

TobyBaril commented 1 week ago

Hi,

The change in unclassified sequences are most likely due to the way that RepeatMasker 4.1.6 has been configured. Which database partitions have you configured when installing RepeatMasker 4.1.6, and are these Dfam and RepBase, or only Dfam? Only curated elements will be used by RepeatClassifier to classify de novo TEs, so the underlying database can have a big impact on the final curation. We have not extensively tested with RepeatMasker 4.1.6 (although I have done some), but it should work okay as long as the correct databases are installed and configured, and that RepeatModeler has also been reconfigured following the installation of RepeatMasker. There might also have been an error during the RepeatClassifier section if this configuration has not worked - I can take a look if you are happy to share the log file! Generally, unless there are significantly more curated sequences in Dfam 3.8, we recommend Dfam 3.7 and RepeatMasker 4.1.5 for stability. New changes to RepeatMasker are coming soon that should improve the current issues with the Dfam paritions.

Regarding the change in landscape, @jamesdgalbraith will likely be able to give more insight. The new scripts have been updated to be more species-agnostic, as the original RepeatMasker scripts are CpG adjusted for human. Now, Kimura 2-parameter distance is used instead, which could lead to some changes. Although looking at the scales this change is likely due to the shift from percentage of genome coverage to total number of base pairs leading to changes in the scaling of the bars.

I hope this helps, but feel free to ask anything else if needed!

jamesdgalbraith commented 1 week ago

Hi Elisabeth,

The change in the landscape could be due to a couple of factors.

The first is that our new method of calculating divergence simply calculates Kimura 2 parameter distance using the number of matches, transitions and transversions from EMBOSS alignments of the genomic sequences to the curated consensus RepeatMasker labels the genomic sequence as in its .out file (for the formula see Kimura 1980 https://doi.org/10.1007/bf01731581). The old plots simply used the output from RepeatMasker. The problems with the RepeatMasker values are that a) the distances are adjusted based on CpG content using values calibrated specifically for the human genome and b) the "Kimura Distance" provided by RepeatMasker is not necessarily the divergence from the sequence it's labelled as in the .out file. To explain this as simply as I can, the coordinates on one line/sequence in a RepeatMasker .out are often an almagamation of multiple overlapping different RepeatModeler familiesof the same superfamily. To give an example of where I found this occuring, I had a position in a genome (say chr1:16000-21000) which RepeatMasker annotated as "rnd-1_family_123#LINE/BovB" which was significantly longer than the rnd-1_family_123#LINE/BovB consensus sequence. Looking at the .align file it became clear that RepeatMasker's BLAST search found several different LINE/BoVB sequences (i.e. rnd-1_family_123#LINE/BovB, rnd-2_family_4#LINE/BovB and rnd-4_family_3#LINE/BovB) aligned to different portions of the reported coordinates. What RepeatMasker's postprocessing did was collapse those overlapping coordinates into the coordinates in the .out file and either used one of the three sequences divergence, or tried to average them (I tried reading through the RepeatMasker perl code but couldn't figure out which it is).

The second potential (but less likely) reason is that every time you run EarlGrey RepeatModeler is going to use different seed for determining which portion of the genome it will use in RepeatScout and RECON's de novo annotation. This leads to every repeat library output by RepeatModeler being different. However, unless you're annotating a massive genome with incredible enrichment of different TEs in different regions I wouldn't expect that would make much difference.

I hope that helps.