ID problems in taxonomy and locations files

acontrerasg commented 4 years ago

Hello,

I have a little confusion on how to make the taxonomy file and the location gff file. I tried to make them from the RepeatMasker out file from scratch. But the tool complains that it doesn't find a TE ID despite being in both files:

TE ID: ID=35_44 not found in IDs from GFF: ~/locations.gff3 please make sure each ID in: ~/taxonomy.tsv is found in: ~/locations.gff3

Despite being in both files:

locations.gff3:

Chr1 RepeatModeler RC/Helitron 66707 66797 . . . ID=35_44

and taxonomy.tsv:

ID=35_44 rnd-6_family-6781#RC/Helitron

Any suggestions on how I need to make those two files?

Thanks in advance!

pbasting commented 4 years ago

Hi @acontrerasg, There are examples of each file in the test (https://github.com/bergmanlab/mcclintock/tree/master/test) directory:

Essentially the locations gff needs to have a unique ID= attribute for each TE. The Taxonomy tsv needs to have two tab separated columns, the first containing the unique ID from the gff ID= attribute, the second containing the family of that TE which corresponds to the name of one of the consensus sequences in the consensus fasta.

for example if your locations gff looks like:

chrI    reannotate  transposable_element    160239  166163  .   -   .   ID=chrI_u1-_TY1
chrI    reannotate  transposable_element    182614  182953  .   +   .   ID=chrI_s4+_TY3

your locations tsv should look like:

chrI_u1-_TY1    TY1
chrI_s4+_TY3    TY3

and you should have a consensus sequence for the TY1 and TY3 family in your consensus sequence fasta (-c)

>TY1
tgttggaataaaaatccactatcgt...
>TY3
tgttgtatctcaaaatgagatatgt...

For your example it looks like you need to remove the ID= from the first column of your TSV

35_44 rnd-6_family-6781#RC/Helitron

It's worth noting that these files (locations gff and taxonomy TSV) are optional and mcclintock will attempt to generate them using your consensus sequences and repeatmasker if they are not provided.

acontrerasg commented 4 years ago

Thanks for the quick and detailed answer @pbasting!

I did those few changes and now it runs smoothly. Somehow I missed when I read the readme the provided file format examples. My bad!

Best regards!

cbergman commented 4 years ago

Thanks both for resolving this issue so quickly. @pbasting I think this reveals that we should do a better job of explaining the structure of the taxonomy file & the fact that it is generated automatically if not provided. Maybe we need a McClintock input section in the readme? We could also update the readme to link directly to the example taxonomy file (rather than just refer to it in text). Similarly, we may want to add some info about the format/autogeneration of the taxonomy file in the -t TAXONOMY option of the help menu. @acontrerasg: any other thoughts on what might make this more clear are much appreciated!

acontrerasg commented 4 years ago

Dear @cbergman, I think overall is quite clear, but indeed a distinction of required vs optional flags will be great. Also a small snippet: like 2,3 lines of the format of both taxonomy and locations, like what pbasting showed me, in the Running McClintock section below the -methods explanation will be really useful to be absolute clear what one has to do.

Thanks for the tool!

Song-10-YF commented 1 year ago

Hello @cbergman Can I provide TE annotation files and TE classification files in complex formats? My previous genomic TE gff files annotated by RepeatModeler+Repbase+DeepTE are very complex. For example, there are several of them with the same name LTR/Copia and the sequences in the Lib are different, so it is not possible to use one LTR/Copia as a classification criterion, and no other information is provided in the naming. How do I provide canonical gff and tsv in this case?

cbergman commented 1 year ago

Hi @Song-10-YF

Thanks for your query. No, you can only supply the TE annotation and classification files in the specified formats documented above: https://github.com/bergmanlab/mcclintock/issues/64#issuecomment-668906625.

From what you say, it might be helpful for you to understand that the taxonomy file needed for McClintock is not a full taxonomic classification of all TEs in your library, but rather a taxonomic labeling of annotated TE instances in the reference genome matching each instance in the reference genome to the TE consensus in the library. Please see my response to your other recent query here for information about how to run McClintock using the outputs of RepeatModeler: https://github.com/bergmanlab/mcclintock/issues/117#issuecomment-1576764054.

I hope this helps, Casey

bergmanlab / mcclintock

ID problems in taxonomy and locations files #64