geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
43 stars 89 forks source link

Assembly Step for new pipeline kernel: Questions and Strategies #1676

Open dougli1sqrd opened 3 years ago

dougli1sqrd commented 3 years ago

In https://github.com/geneontology/pipeline/issues/206 we're making steps to reform the pipeline kernel. Currently, @dustine32 and I are working on the Assembly step ("shovel2pile") which should take "pristine", validated annotations in gpad+gpi format and merge any mixin gpads into the final produce.

For example, we have mgi and paint_mgi. At the end of the run, a validated paint_mgi will be merged into a validated mgi, and their corresponding headers will also be joined, to produce the final mgi dataset product.

Here we discuss various strategies for this:

kltm commented 3 years ago

Basically, I think the thing to argue against is as follows: Let's say we have a directory with all valid products in it. Without the metadata file, would there ever be a situation where I couldn't eyeball it and assemble things correctly? (I think that not bothering with the metadata might also put us in a better situation to pivot to species-orientation.)

dougli1sqrd commented 3 years ago

We definitely could eyeball it and figure it out. But that's only because of how the names historically happen to line up: paint_mgi goes into mgi. If we did it this way, then the name would convey real semantic meaning. Which is fine if we want to do that, but I think I feel mild discomfort about it? Maybe it just feels brittle. But I'm definitely not opposed. We'd have to document this fact somewhere.

Although, now that I'm saying this, we are the ones that control all the "mixins", so the naming convention is mostly on us anyway. I'm less discomforted by that since realistically we will mostly control the mix-in sources.