6th column of GFF output

lukesarre commented 1 year ago

Hi all, This is a naive question, as I'm not personally a user of EarlGrey, but am interpreting output produced by a collaborator. What is the meaning of the 'score' column of the GFF output? Is it Kimura distance? I apologise if this information was included in the documentation but I missed it. Thank you in advance, Luke

TobyBaril commented 1 year ago

Hi Luke,

Apologies, this isn't currently in the documentation! The "score" column shows the score of the annotation from the final RepeatMasker step, so is essentially the score of the match between the consensus sequence and the locus using the altered BLAST algorithm employed by RepeatMasker (rmblastn).

Hopefully this is what you were looking for, but feel free to reach out if you have any other questions!

Best wishes,

Toby

lukesarre commented 1 year ago

Hi Toby,

That did indeed completely answer my question!

What I'm trying to achieve is a measure of divergence for each TE from the consensus. It sounds like the "score" column is pretty close to this, but it's very difficult to find good documentation on where the nitty gritty of this score comes from, so I'm wary of using it.

As an optional output of RepeatMasker, you can get alignment files which include a Kimura distance for each TE from its family's consensus. Is it possible to get this with EarlGrey? Or to access the Kimura distances it holds?

Thank you for your help and best wishes, Luke

lukesarre commented 1 year ago

Hi Toby,

I now understand that you can access the .divsum file which is produced by the final RepeatMasker run, and this is then used for producing the repeat landscape plots.

Because there is an additional RepeatCraft before the final annotation, I would expect the values in the .divsum file cannot be attributed to specific repeats in the final annotation. Is that right?

Would it be possible to, for example, run the calcDivergenceFromAlign.pl on the repeatcraft output?

For context, we are relating divergence with other genome characteristics, so it is useful to be able to have a divergence score for each repeat in the final annotation

Luke

TobyBaril commented 1 year ago

Hi Luke,

Thanks for your email. Apologies for the delay in responding, I am shortly starting a new role and am in the process of emigrating!

The alignment values are calculated following the final RepeatMasker run. To generate the repeat landscapes, calcDivergenceFromAlign.pl calculates average divergences for each individual RepeatMasker family from the individual annotations in the align file. This makes it particularly useful for getting a broad overview of the TE activity landscape in a genome, with the caveat that this is made totally from averages and negates any consideration of processes that might lead to the presence of certain fragments of a TE (e.g LINEs without a 5’ end are likely to arise from difference genomic processes to LINEs without a 3’ end due to their transcription via RT). In the broad overview case, this is okay as a high-level proxy for TE activity timing. As this is all based on averages, and the inner workings of RepeatMasker will only consider positions with an identity, the age estimates pre- and post-RepeatCraft show very little, if any, variation, as the underlying method itself will ignore the “gaps” between two TE fragments that have been merged and calculate then average the age of the fragments anyway (this is one of the many weird things that RepeatMasker does but doesn’t explicitly state).

However, in your case it is more important to get the divergence of individual repeats and I’m assuming that these are going to be considered individually? To do this, the important values are either going to be in the RepeatMasker .align file (that we have no control over as this is generated as part of the RepeatMasker run, rather than post-processing), or will need to be generated on an individual TE-by-TE basis. Dr. James Galbraith has done some of this before (based at Exeter Campus in Penryn, Cornwall), where the TE sequences are extracted using BED coordinates, aligned to their respective consensus sequences, and Jukes-Cantor scores are calculated for each individual TE.

We haven’t implemented this into Earl Grey as of yet as our main aim is to facilitate high-level overviews in a simple manner for non-specialists, to hopefully improve the quality of TE sequences that find themselves in TE reference databases (which contain many non-TE sequences that are deposited with no checking). For the more detailed analyses, processing following running Earl Grey is likely to be the best course of action. Regarding accurate calculation of divergence for each individual TE, rather than average for each family, this is something that is on our development list at the moment, and should be added at some point in the future.

Hopefully this has helped, and please don’t hesitate to get in contact again if I can be of any more assistance.

Best wishes,

Toby

On 9 Jan 2023, at 11:32, lukesarre @.***> wrote:

Hi Toby,

I now understand that you can access the .divsum file which is produced by the final RepeatMasker run, and this is then used for producing the repeat landscape plots.

Because there is an additional RepeatCraft before the final annotation, I would expect the values in the .divsum file cannot be attributed to specific repeats in the final annotation. Is that right?

Would it be possible to, for example, run the calcDivergenceFromAlign.pl on the repeatcraft output?

For context, we are relating divergence with other genome characteristics, so it is useful to be able to have a divergence score for each repeat in the final annotation

Luke

— Reply to this email directly, view it on GitHub https://github.com/TobyBaril/EarlGrey/issues/33#issuecomment-1375492162, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALE6FIZADY36NAVDKSDFGGTWRPZL3ANCNFSM6AAAAAAS5CBAZY. You are receiving this because you modified the open/close state.

TobyBaril / EarlGrey

6th column of GFF output #33