geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
43 stars 89 forks source link

Change version of GPAD output from 1.2 to something else that's more appropriate #2342

Open kltm opened 1 month ago

kltm commented 1 month ago

Currently, the GPAD output from minerva's gpad pipeline (that is technically publicly available in products/upstream_and_raw_data) looks like:

!gpa-version: 1.1
!collation date: 2024-07-03
!collated from production models in https://github.com/geneontology/noctua-models/ where col1 matches fb; special rules for MGI (https://github.com/geneontology/pipeline/issues/313)

However, the actual output does not conform to a GPAD 1.1 spec, as there is none. As well, it does not conform to any public spec. For people making parsers or using these files, like @alexsign and MGI, we'd like to indicate that this pseudo-internal format is not really conforming to anything.

Noting that this is in collate-gpads.pl, so it's trivial to change.

kltm commented 1 month ago

Any thoughts on the right header here?

I think I'd be happy with no gpa-version, as there is no official spec. However, I suspect that this would probably break us on the ontobio side, as there is likely an implied parser for this in the code.

Tagging @pgaudet and @balhoff, @dustine32, @mugitty, and @sierra-moxon

dustine32 commented 1 month ago

Looks like some version line is desired but not required (there is a self.default_version, which is currently 1.2) in the GPAD parser here with the regex being permissive for both gpa-version and gpad-version:

parser_version_regex = re.compile(r"!([\w]+)-version:[\s]*([\d]+\.[\d]+(\.[\d]+)?)")
kltm commented 1 month ago

@dustine32 Is there a different code path for "1.1"?

dustine32 commented 1 month ago

@kltm Is 1.1 being written out by either minerva or that collate-gpads.pl perl script?

kltm commented 1 month ago

@dustine32 minerva is just writing lines; there is a secondary script that is collating them and adding a header.

krchristie commented 1 month ago

@kltm - The MGI GO group discussed this at our meeting this morning (7/11/24) and we no longer use this file. We are waiting for the full file of all mouse GO annotations that Sierra is working on.

dustine32 commented 1 month ago

@kltm Ah right! Looks like it's here: https://github.com/geneontology/go-site/blob/feeac4959065e2b868147108fdc6cd0c55e4b3f9/scripts/collate-gpads.pl#L80

kltm commented 1 month ago

Noting that -src is 1.1 (collate-gpads.pl), while the "final" is 1.2. The ontobio processing step is (now?) adjusting the header to the parser that was used.

kltm commented 1 month ago

Okay, I feel pretty confident, that due to under-specification of GPAD 1.1 (https://geneontology.org/docs/gene-product-association-data-gpad-format-1.1/) and the moving targets and lack of documentation for things like, qualifier, annotation extension, and annotation properties, it's a little hard to call what's there for the minerva -src output GPAD 1.1. Moreover, the public GPAD 1.1 information often refers to itself as GPAD 1.2, which just adds to the confusion. There is no technical doc for GPAD 1.1 in the go-annotation repo.

While there is no public doc for GPAD 1.2, there is a (draft) tech doc in the annotation repo (https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md). Amusingly, the GPAD 1.2 tech doc also conflates 1.1 and 1.2. The spec has a mess of bad links, TODOs, and references to upcoming or otherwise non-existent documentation. While hard to use as a full-on spec, it seems like the GPAD output of ontobio is reasonably close, with nothing that caught my eye as a "nope" right away.

Diffing the -src and "final" file from minerva output, there are only minor differences. With that, I'd suggest changing the 1.1 in the -src file to 1.2 and calling this done.

Checking in with @pgaudet on this proposal.