C. elegans PAINT annotations have both WBGene IDs and UniProtKB accessions in column 2 of the GAF

geneontology / paint

This curation tool allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions.

Other

4 stars 4 forks source link

C. elegans PAINT annotations have both WBGene IDs and UniProtKB accessions in column 2 of the GAF #56

Open vanaukenk opened 4 years ago

vanaukenk commented 4 years ago

Hi,

In trying to sort out and correct the 'type' entry in Column 12 of the WB GAF, I was also looking into why we have a small percentage of entries in the file that have lines with 'type' protein (most of our entries are 'type' gene which we'll need to fix).

It looks like the protein entries are cases where there's an IBA annotation associated with a UniProtKB accession rather than a WB gene.

Here are two examples of the different types of PAINT entries:

WB WBGene00001856 hil-5 GO:0045910 PMID:21873635 IBA PANTHER:PTN000157378|SGD:S000006048 P B0414.3|H1.5 gene taxon:6239 20180503 GO_Central

I can see xrefs between the lgc-29 WBGene ID and the UniProtKB accession in both WB and UniProt, but am not sure how the PAINT pipeline maps UniProtKB accessions to WBGene IDs, so am not sure what to fix and where.

@dustine32 - when you have time can you take a look and let me know if this is something we need to address at WB?

Thx.

dustine32 commented 4 years ago

@vanaukenk This is determined by our yearly Panther version build process, which parses UniProt's xref files to map UniProtKB IDs to MOD IDs. In looking at the current Panther version's (14.1) source files as well as the new RefProt files for this year's Panther (15.0), I see the xref connecting A0A168H3E7 to WB:WBGene00044528, so I think we're getting the right data.

But this WB xref might not have been present several versions ago when the protein was first added to Panther and, since Panther IDs (including the mappings) tend to get reused from version to version, we're probably retaining this stale mapping (currently to CELE_Y45G5AL.2) without checking if there's a better (i.e. MOD-specific) mapping.

Fixing this in the Panther build process might take some time as the ID mapping script is a Perl-y beast. Thank you for the test cases, they'll be very handy!

vanaukenk commented 3 years ago

@dustine32 @huaiyumi

It's been a bit since I first entered this ticket, but @chris-grove has noted that the presence of non-WBGene identifiers in our GOC-produced GAF is now having trickle-down effects for Alliance Mine, as some C. elegans entries do not resolve to a WBGene id.

What are the prospects for getting the non-WB gene ids in Panther upgraded to their respective WBGene ids?

If the ids cannot be upgraded in Panther directly, would it be possible to upgrade them in the src annotations files that Panther produces?

Thx, and please let us know if you need more information.

dustine32 commented 3 years ago

Sorry that this ticket got left behind @vanaukenk! Those example IDs will need some manual intervention from us to get them to pick up the correct WB ID in the next PANTHER library build.

In the meantime, as you suggested, we could just add a GAF bandaid for the IBAs similar to what we currently do with some TAIR IDs. We'd make a small UniProt->WB lookup file for the few (~200) PTHR long IDs (e.g. CAEEL|Gene_ORFName=CELE_Y45G5AL.2|UniProtKB=A0A168H3E7) where the middle ID prefix (e.g. Gene_ORFName) is something other than "WormBase". This could possibly get into the IBA release after the next GO release. @huaiyumi What do you think?

huaiyumi commented 3 years ago

This is caused by reference proteome. When they don't provide the WB ID in their file, we can't do anything. It is important to ask them to fix the data.

vanaukenk commented 3 years ago

Thanks @dustine32 and @huaiyumi !

@dustine32 - for the look-up file, would it make sense to use the WormBase gpi file? The URL for that is in our wb.yaml file. I don't know if that will fix everything, but perhaps this would be a more general solution if other groups encounter the same id issue in the PAINT annotation file.

@huaiyumi - I don't know that much about the details of the reference proteome files and dataflow to Panther. Which file would need to be updated and who should we contact about that? At what frequency in the Panther data lifecycle do identifiers get updated?

It's possible that, at least in part, we're dealing with asynchronous release cycles amongst WB, UniProt, and Panther, but let's see how close we can get to a complete set of WB gene ids in the PAINT annotation file.

Thanks!

dustine32 commented 3 years ago

As it turns out, the WB mappings were in the upstream Reference Proteome file for 16.0, but our ID mapping script reused the same long ID since it was in the previous library. Something happened several library builds ago that prevented the WB-mapping and the error kept being passed along.

I have a mechanism for "flushing" these stale IDs out so that they all pick up the right MOD mappings (so long as they are in the ref prot data). But this is for the 17.0 library build likely to come out late this year. In the meantime, for the IBA GAFs, we could use the MOD GPIs to correct any UniProtKB IBAs. There's another ticket asking to use the MGI GPI to filter out obsoleted IDs, which I was a bit hesitant about just to prevent further complexity.

chris-grove commented 3 years ago

Hello all, I just want to add that there are also IDs that are not UniProt IDs that need to be addressed (I have no idea where they are coming from). These identifiers are identical to CDS IDs used at WB for genes. There are 30 of these used across 124 annotations:

C16D6.2b C33H5.11d C34F11.9e C34G6.4b C44E4.1d C46H11.11e F09E5.15c F16B3.1b F25E2.5d F25H5.1i F52H2.2b F53A2.8a K10G6.3g M176.9b R02C2.4b R13A5.11b T01D1.2l T05A10.5b T23D8.1b T27A1.2b W07G9.2b Y106G6D.5a Y108G3AL.7b Y111B2A.14c Y116A8C.36d Y32G9A.6d Y37A1B.1a Y50D4B.2b Y53G8AL.4b Y75B7AL.4c

These also need to get updated/mapped/resolved to a WBGene ID. I just wanted to make sure those don't get overlooked.