geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Only add GAF col 17 output (isoform) when necessary #2350

Closed kltm closed 1 month ago

kltm commented 3 months ago

In the GAF 2.2 output, now that we've switched over to the new code, column 17 is being filled when not necessary (which is a little confusing and against the spec).

@pgaudet Would you have any good examples of this for reference?

pgaudet commented 3 months ago

June 2024 GO release file has no col 17 information:

!gaf-version: 2.2 ! !generated-by: GOC ! !date-generated: 2024-06-19T02:24 ! UniProtKB A0A024RBG1 NUDT4B enables GO:0003723 GO_REF:0000043 IEA UniProtKB-KW:KW-0694 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20240408 UniProt
UniProtKB A0A024RBG1 NUDT4B enables GO:0046872 GO_REF:0000043 IEA UniProtKB-KW:KW-0479 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20240408 UniProt
UniProtKB A0A024RBG1 NUDT4B located_in GO:0005829 GO_REF:0000052 IDA C Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20230619 HPA
UniProtKB A0A075B6H7 IGKV3-7 involved_in GO:0002250 GO_REF:0000043 IEA UniProtKB-KW:KW-1064 P Probable non-functional immunoglobulin kappa variable 3-7 IGKV3-7 protein taxon:9606 20240408 UniProt

While the files on snapshots do:

!gaf-version: 2.2 ! !generated-by: GOC ! !date-generated: 2024-07-15T21:54 ! !Header from source association file: !================================= ! !generated-by: GOC ! !date-generated: 2024-07-14T14:58 !================================= ! !Documentation about this header can be found here: https://github.com/geneontology/go-site/blob/master/docs/gaf_validation.md ! UniProtKB A0A024RBG1 NUDT4B enables GO:0003723 GO_REF:0000043 IEA UniProtKB-KW:KW-0694 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20240610 UniProt UniProtKB:A0A024RBG1 UniProtKB A0A024RBG1 NUDT4B enables GO:0005515 PMID:33961781 IPI UniProtKB:Q8NFP7 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20240608 IntAct UniProtKB:A0A024RBG1

pgaudet commented 3 months ago

Oh ! I didn't see that yesterday. Looks like the GOA source file has data in Col 17 everywhere:

!gaf-version: 2.2 ! !date-generated: 2024-06-20 12:51 !generated-by: UniProt !go-version: http://purl.obolibrary.org/obo/go/releases/2024-06-19/extensions/go-plus.owl ! UniProtKB A0A024RBG1 NUDT4B enables GO:0000298 GO_REF:0000033 IBA PANTHER:PTN000290327|RGD:1310183|SGD:S000005689|UniProtKB:O95989 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20231108 GO_Central UniProtKB:A0A024RBG1 UniProtKB A0A024RBG1 NUDT4B enables GO:0003723 GO_REF:0000043 IEA UniProtKB-KW:KW-0694 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20240610 UniProt UniProtKB:A0A024RBG1 UniProtKB A0A024RBG1 NUDT4B enables GO:0005515 PMID:33961781 IPI UniProtKB:Q8NFP7 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20240608 IntAct UniProtKB:A0A024RBG1 UniProtKB A0A024RBG1 NUDT4B enables GO:0008486 GO_REF:0000033 IBA FB:FBgn0036111|MGI:MGI:1930957|MGI:MGI:2147931|PANTHER:PTN000290327|SGD:S000005689|UniProtKB:O95989|UniProtKB:Q9NZJ9 F Diphosphoinositol polyphosphate phosphohydrolase NUDT4B NUDT4B protein taxon:9606 20230110 GO_Central UniProtKB:A0A024RBG1

Maybe we need to leave it for now - I need to check with Alex next week (he's back from vacation on Monday)

Thanks, Pascale

sierra-moxon commented 2 months ago

@pgaudet - is this something (column 17 annotation from GOA) that I need to remove during processing?

kltm commented 2 months ago

@sierra-moxon I believe the idea here is to not fill the field when it is duplicative--there is no real use (and some confusion) with "UniProtKB A0A024RBG1" in cols 1 and 2 and then "UniProtKB:A0A024RBG1" in 17.

pgaudet commented 1 month ago

Discussing with Alex. This is not very critical, we can leave the same data as in col 1/2.