Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

get Motif from ID and Name #197

Closed MarineBergot closed 1 year ago

MarineBergot commented 1 year ago

Hi,

I have question. Is there a way to get the Target Motif from ID/Name. I just downloaded gff3 created with RepeatMasker and published by a team on new version of the genome of Chlamydomonas (v6 : https://www.biorxiv.org/content/10.1101/2022.06.16.496473v1.full) in the gff i have : chromosome_01 RepeatMasker similarity 2540266 2540351 30.1 + . ID=2545373.687;Name=(GGC)n;Target=(GGC)n 1 81;class=Simple_repeat chromosome_01 RepeatMasker similarity 2541035 2541117 27.2 + . ID=2545373.688;Name=(AGC)n;Target=(AGC)n 1 83;class=Simple_repeat

but i need something like that to run TEFinder: U_39 RepeatMasker similarity 13293 13473 23.7 - . Target "Motif:hAT210-short-Active" 248 428 U_39 RepeatMasker similarity 13616 13701 26.7 - . Target "Motif:hAT210-short-Active" 8 93 U_39 RepeatMasker similarity 13623 13732 25.7 - . Target "Motif:Fot1Active" 28 138

Is there a way to jump from on to another?

Thanks a lot !

rmhubley commented 1 year ago

GFF is an absurdly general format and there are many ways to interpret the meaning of individual terms. Unfortunately there isn't a way to alter the GFF format that RepeatMasker generates, however a simple script could be written to translate it into whatever format you need. I am not familiar with what TEFinder needs in terms of the TARGET format, but I suspect you could prefix all the Target names with the "Motif:" prefix, add quotes around it and TEFinder may be happy.

MarineBergot commented 1 year ago

Well yeah of course, changing the format is not my problem. But this file was generated with repeatmasker and this annotation as well. I need the name of the transposon otherwise the soft can't run then i was hoping that you have an Idea about which database was used to generate this annotation and that i could mixte the 2 informations to rebuild my gff

rmhubley commented 1 year ago

Oh! You specifically want to know where to find the sequence for a given target? Sure, the two examples you gave above are not transposons but simple repeats generated by TRF. In such cases, rather than a consensus identifier the TRF repeated unit is reported in parentheses and suffixed with an 'n'. For actual transposable elements this Target field would contain the consensus ID from the TE library used in the search. The identifier could be anything from a Repbase ID, Dfam ID, or any custom library ID provided to RepeatMasker. Let me know if that answers you question.

MarineBergot commented 1 year ago

yeah sorry my question was probably not very clear ^^' yeah basically I have a list of transposons (like Bill (DQ446204.1), Gulliver (AF019750.1 and AF019751.1), MRC1 (DQ446210.1), Pioneer1 (U19367.1) etc.) and i would like to find them in the repeatmasked gff given for the new version of the Chlamydomonas genome (with no name inside, only this ID) Then with the name or NCBI id or sequence i can require on the Repbase ID or Dfam ID to try to get the id on the base and after find them in my gff? thanks!

rmhubley commented 1 year ago

Can you point me to the GFF file in question? I don't know what they used as a TE library when they generated that file so I really can't say what type of ID to expect (other than the Simple Repeat lines). If its Dfam accession numbers you can either look them up at the Dfam website here: https://www.dfam.org/browse, or translate them into names (if they have names) using the Dfam API:

# Example using curl and jq
% curl -s https://www.dfam.org/api/families/DF0000001 | jq '.name'
"MIR"
MarineBergot commented 1 year ago

well apparently, according to the paper ( https://www.biorxiv.org/content/10.1101/2022.06.16.496473v1.full) they are using the Repbase (TE sequence was identified in each assembly by providing the latest Chlamydomonas repeat library to RepeatMasker v4.0.9 (Smit et al. 2013-2015). This library features updated consensus models for all Chlamydomonas repeats available in Repbase (https://www.girinst.org/repbase/)) it will be hard for me to point the gff because you need to register on website to have access to it but i can send it to you if needed. if i undestand well, to have access to Repbase you need to pay ?

rmhubley commented 1 year ago

That is helpful. So, if they are using Repbase all the Target fields in the GFF (for TE annotations) will refer to Repbase database entries. Repbase is not an open database so you will need to contact GIRI to obtain access to it -- wish it were not so but there is not much I can do about that.