Assembly Step for new pipeline kernel: Questions and Strategies

dougli1sqrd commented 3 years ago

In https://github.com/geneontology/pipeline/issues/206 we're making steps to reform the pipeline kernel. Currently, @dustine32 and I are working on the Assembly step ("shovel2pile") which should take "pristine", validated annotations in gpad+gpi format and merge any mixin gpads into the final produce.

For example, we have mgi and paint_mgi. At the end of the run, a validated paint_mgi will be merged into a validated mgi, and their corresponding headers will also be joined, to produce the final mgi dataset product.

Here we discuss various strategies for this:

Final <dataset> = Sum[<dataset>.header, <mixin0>.header, <mixin1>.header, ...] + Sum[<dataset>, <mixin0>, <mixin1>, ...]
- example, paint_mgi will have in metadata: merges_into: mgi.
  1. download-annotation-sources.py annotations -g mgi -g paint -x [the rest of paint]
    - sources: mgi.gpad, paint_mgi.gpad,
  2. goat pristine sources/
    - pristine: mgi_valid.gpad, paint_mgi_valid.gpad
  3. goat assemble
    - assemble: mgi.gpad (contains mgi_valid and paint_mgi_valid), paint_mgi.gpad
- So how does assemble know that paint_mgi_valid should be mixed into mgi_valid?
  - mgi_valid -> mgi; paint_mgi_valid -> paint_mgi; <mixin>_<dataset>
    - paintmgi is a mixin because when we match ``<dataset> matches an existing source, namely "mgi".
    - We find potential mixins by the filename, and separate on the first underscore. If we get a mixin pattern, we can check if the <dataset> part of the name corresponds to an existing file in "pristine". If it does, then we have a <dataset>, and a <mixin>_<dataset> match.
    - We can then look at the datasets yaml. For a mixin: <group>_<dataset>, look in <group>.yaml for a <dataset> entry, and if it merges_into: <dataset>. If so, we can confirm that this mixin should merge into the given dataset name.
    - A drawback with this is we're very tied to the filenames and dataset names
  - Alternatively: instead of the mixin metadata yamls saying what they merge into, we change the metadata so that primary datasets state what mixins they desire. Example: mgi would have: "has_mixin": ["paint_mgi"]
  - For every file in "pristine", we look up the metadata entry for that file, and look for any mixins. If we also have a file with the mixin name, we perform the mixin logic above.
  - Drawback: This requires changing the metada yamls formally.
  - This seems ultimately easier though, and less brittle to filename/dataset name changes

kltm commented 3 years ago

Basically, I think the thing to argue against is as follows: Let's say we have a directory with all valid products in it. Without the metadata file, would there ever be a situation where I couldn't eyeball it and assemble things correctly? (I think that not bothering with the metadata might also put us in a better situation to pivot to species-orientation.)

dougli1sqrd commented 3 years ago

We definitely could eyeball it and figure it out. But that's only because of how the names historically happen to line up: paint_mgi goes into mgi. If we did it this way, then the name would convey real semantic meaning. Which is fine if we want to do that, but I think I feel mild discomfort about it? Maybe it just feels brittle. But I'm definitely not opposed. We'd have to document this fact somewhere.

Although, now that I'm saying this, we are the ones that control all the "mixins", so the naming convention is mostly on us anyway. I'm less discomforted by that since realistically we will mostly control the mix-in sources.

geneontology / go-site

Assembly Step for new pipeline kernel: Questions and Strategies #1676