Closed davmlaw closed 2 weeks ago
Apparently I did this before in #14 - There is a comment - "Switched to using GTFs as they contain protein version"
I wrote the GTF code to look for "protein_id" and GFF3 to look for GenBank
However, we've switched entirely back to gff files, including for Ensembl which doesn't have GenBank
It looks like if we switched to using protein_id in all GFFs it would work, they are the same across RefSeq Ensembl.
I looked in RefSeq files, and they are good. So this is an Ensembl only problem, will change title
ok, so we switched to GTF because Ensembl GFFs don't have protein versions
But I then switched to GFF3 in a commit involving tags (eg MANE_Select) that are in Ensembl 108+
I looked at the spec for GFF3 and CDS is supposed to have version in it, but none of them do. Have emailed Ensembl. They said they will fix the README and try to get full versions in the next release
I will look at whether the files have what I need
Annotation Consortium | File Type | tag | protein_id | protein_version |
---|---|---|---|---|
RefSeq | GFF3 | ? | NP_001659.1 | - |
RefSeq | GTF | ? | NP_002294.2 | - |
Ensembl | GFF3 | basic,Ensembl_canonical,MANE_Select | ENSP00000400379 | - |
Ensembl | GTF | tag=basic tag=Ensembl tag=MANE_Select | ENSP00000477624 | 1 |
So for protein IDs - looks like we can use GTF or GFF for RefSeq, but need to use GTF for Ensembl
The trouble is, HTSeq.GFF_Reader doesn't handle the multiple tags, as it reads into a dict and only returns the last one
ok, so the trick was:
As discovered in #82 - c_to_p doesn't currently work
Once complete, go back to the linked issue and notify user / close