SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

Ensembl files missing protein - breaking c_to_p #83

Closed davmlaw closed 2 weeks ago

davmlaw commented 3 weeks ago

As discovered in #82 - c_to_p doesn't currently work

Once complete, go back to the linked issue and notify user / close

davmlaw commented 3 weeks ago

Apparently I did this before in #14 - There is a comment - "Switched to using GTFs as they contain protein version"

I wrote the GTF code to look for "protein_id" and GFF3 to look for GenBank

However, we've switched entirely back to gff files, including for Ensembl which doesn't have GenBank

It looks like if we switched to using protein_id in all GFFs it would work, they are the same across RefSeq Ensembl.

I looked in RefSeq files, and they are good. So this is an Ensembl only problem, will change title

davmlaw commented 3 weeks ago

ok, so we switched to GTF because Ensembl GFFs don't have protein versions

But I then switched to GFF3 in a commit involving tags (eg MANE_Select) that are in Ensembl 108+

I looked at the spec for GFF3 and CDS is supposed to have version in it, but none of them do. Have emailed Ensembl. They said they will fix the README and try to get full versions in the next release

I will look at whether the files have what I need

Annotation Consortium File Type tag protein_id protein_version
RefSeq GFF3 ? NP_001659.1 -
RefSeq GTF ? NP_002294.2 -
Ensembl GFF3 basic,Ensembl_canonical,MANE_Select ENSP00000400379 -
Ensembl GTF tag=basic tag=Ensembl tag=MANE_Select ENSP00000477624 1

So for protein IDs - looks like we can use GTF or GFF for RefSeq, but need to use GTF for Ensembl

The trouble is, HTSeq.GFF_Reader doesn't handle the multiple tags, as it reads into a dict and only returns the last one

davmlaw commented 2 weeks ago

ok, so the trick was: