TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

GFF attributes and wrong TE classification?? #99

Closed DhakadPankaj closed 6 months ago

DhakadPankaj commented 6 months ago

Hi, I've run Earlgrey on >100 Drosophila genomes with default parameters. I'm masking annotated protein coding genes that are in the known repeat rich regions using the filtered repeats gff/bed files produced by Earlgrey. But I'm not sure what some of the gff attributes means, like TSTART/TEND (Is this the actual boundaries of the repeat??). Also, I found many of the LINE repeats identified overlaps with the exons of AGO2 genes, which is a essential gene in RNAi pathways and It doesn't look like it is a LINE (PFA files containing Ago2 exons & repeat overlaps ).
Ago2exon_repeat.zip

Thanks, Pankaj

TobyBaril commented 6 months ago

There are many factors that can affect the classification of TEs using RepeatClassifier (which provides classification for consensus sequences, and is part of RepeatModeler2). RepeatClassifier will use the database of repeats that RepeatMasker has been configured with to help determine the potential class of each consensus. In some cases, there is a little homology in a consensus sequence that score high enough to result in the consensus sequence gaining a classification.

As always, automated tools are meant to provide a starting point to aid with genomics studies. Repeat annotation is inherently challenging and no one method will give "perfect" results. Due to this, some level of manual curation is always needed for robust TE classification and annotation. In the case of the AGO2 genes, there is likely some small track of homology to something in your TE database that results in RepeatClassifier giving a consensus a classification. Also, if these are multicopy, they will be picked up by RepeatModeler, but manual curation will help in removing these.

There are also pre-existing libraries of good quality depending on which species of Drosophila you are working on, so I would recommend looking into some of these as well.

Regarding the attributes, these are used in RepeatCraft to aid in resolving and defragmenting annotations. It is a way of storing the information for parsing. The boundaries of the TE are as normal found in the chr, start, and end columns of the GFF and BED file.