TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
131 stars 19 forks source link

Filtering overlapping repeats for chimers #95

Closed pellescholten closed 5 months ago

pellescholten commented 5 months ago

Hi!

I was looking for a way to filter out overlapping sequences of my RepeatCraft output and tried your filteringOverlappingRepeats.R script.

However, it seems to have an issue with chimeric or nested repeats. In these cases the overlap is either not resolved or the nested repeat gains a Start of the sequence that is after its End of the sequence.

For example: a LTR nested in a TIR is in the rmerge file

contig_1000 RepeatMasker    CLASSII/TIR 9374    9777    12.2    +   .   Tstart=48;Tend=405;ID=EDTA_TE_00001334_inc;shortTE=T
contig_1000 RepeatMasker    CLASSI/LTR  9514    9612    25.2    +   .   Tstart=5136;Tend=5358;ID=RM2_rnd-5_family-4_unconfirmed;shortTE=T
contig_1000 RepeatMasker    CLASSII/TIR 9444    9645    12.2    +   NA  Tstart=48;Tend=405;ID=EDTA_TE_00001334_inc;shortTE=T
contig_1000 RepeatMasker    CLASSI/LTR  9646    9612    25.2    +   NA  Tstart=5136;Tend=5358;ID=RM2_rnd-5_family-4_unconfirmed;shortTE=T

I am not sure what is the easiest way to solve this in the current code as you would need to update the two repeats at the same time...

Cheers

TobyBaril commented 5 months ago

Thanks for highlighting this! I've added this to my development plan for the next patch and will get to work on it!

TobyBaril commented 5 months ago

Update: I have updated this in a branch that will be tested shortly, I'll update again when the correct behaviour has been observed and the changes have been merged.

TobyBaril commented 5 months ago

Final testing ongoing but expected behaviour is observed. You can get version 4.2.0 from https://anaconda.org/toby_baril_bio/earlgrey - bioconda having space issues in testing at the moment!