Open dougli1sqrd opened 3 years ago
Basically, I think the thing to argue against is as follows: Let's say we have a directory with all valid products in it. Without the metadata file, would there ever be a situation where I couldn't eyeball it and assemble things correctly? (I think that not bothering with the metadata might also put us in a better situation to pivot to species-orientation.)
We definitely could eyeball it and figure it out. But that's only because of how the names historically happen to line up: paint_mgi goes into mgi. If we did it this way, then the name would convey real semantic meaning. Which is fine if we want to do that, but I think I feel mild discomfort about it? Maybe it just feels brittle. But I'm definitely not opposed. We'd have to document this fact somewhere.
Although, now that I'm saying this, we are the ones that control all the "mixins", so the naming convention is mostly on us anyway. I'm less discomforted by that since realistically we will mostly control the mix-in sources.
In https://github.com/geneontology/pipeline/issues/206 we're making steps to reform the pipeline kernel. Currently, @dustine32 and I are working on the Assembly step ("shovel2pile") which should take "pristine", validated annotations in gpad+gpi format and merge any mixin gpads into the final produce.
For example, we have mgi and paint_mgi. At the end of the run, a validated paint_mgi will be merged into a validated mgi, and their corresponding headers will also be joined, to produce the final mgi dataset product.
Here we discuss various strategies for this:
Final <dataset> = Sum[<dataset>.header, <mixin0>.header, <mixin1>.header, ...] + Sum[<dataset>, <mixin0>, <mixin1>, ...]
merges_into: mgi
.mgi_valid -> mgi; paint_mgi_valid -> paint_mgi; <mixin>_<dataset>
`<dataset>
matches an existing source, namely "mgi".<dataset>
part of the name corresponds to an existing file in "pristine". If it does, then we have a<dataset>
, and a<mixin>_<dataset>
match.<group>_<dataset>
, look in<group>.yaml
for a<dataset>
entry, and if itmerges_into: <dataset>
. If so, we can confirm that this mixin should merge into the given dataset name."has_mixin": ["paint_mgi"]