Ensure that GPAD (2.0) can be used as a primary input into the pipeline

kltm commented 4 years ago

As we move forward, with the active Noctua imports being the deadline, the pipeline needs to be able to process GPAD 2.0 as a primary input component (GAF being the only other primary data component at this time).

Tied to this, to actually make use of GPAD elsewhere in the pipeline (e.g. produce GAF for AmiGO to consume), we'll also need to be making use of GPI 2.0 files.

Tagging @dougli1sqrd

dougli1sqrd commented 4 years ago

Datasets Yaml proposal: GPAD datasets should have a new field gpi that is a list of dataset IDs within the group that refers to GPI files that are required for download and processing along with the gpad:

id: foo.gpad
dataset: foo
gpi:
  - foo.gpi

One question though that I have is which datasets are we considering to by the "correct" canonical datasets for a group? Now that we are expecting multiple types of files (gaf or gpad) we need a way of deciding which one is more canonical.

For example, in the goa.yaml group, we have 3 entries that have dataset: goa_chicken: goa_chicken.gaf, goa_chicken.gpad, goa_chicken.gpi. This essentially represents 2 actual sources, as gpad+gpi is one, and then gaf. So even if we linked gpad to it's gpi, we still have to know which to use as the canonical.

We could just go totally one to one: where the goa_chicken.gaf gets processed to our canonical goa_chicken.gaf, the goa_chicken.gpad/gpi gets processed into our goa_chicken.gpad/gpi. To contrast, our current process takes GAF, and then produces all types from the one file. So we have a one to many.

We could also indicate in the yaml file which datasets are the blessed ones to use. Like:

# In goa.yaml
id: goa
sources:
  - goa_chicken.gpad
  - goa_chicken_isoform.gpad
  - goa_human.gpad
  - goa_cow.gaf
  ...
datasets:
  - id: goa_chicken.gpad
    gpi: [goa_chicken.gpi]

This allows the providers to say that this gaf source should be used, while that gpad source should be used.

Or perhaps there should be a flag in the datasets stanzas?

Ultimately, we need a way to map to our outputs. And having 3 sources that have dataset: foo in a file is unclear what to use. What do we think?

kltm commented 4 years ago

@dougli1sqrd I think for deciding what is canonical, that can be left to policy: I can think of no reason to ever have redundant datasets coming from a resource and we should work to prevent that.

dougli1sqrd commented 4 years ago

Ah okay, so we should have goa decide either goa_human gaf or goa_human gpad+gpi, and then remove the redundant one? I wonder if we can change the schema to allow only unique dataset key values.

kltm commented 4 years ago

GPI can be there wither way, it's whether the GAF or GPAD(+GPI) should be the "primary" data source, assuming they aren't both different things. The assumption should be that unless otherwise marked, all data is processed through the pipeline. I would either comment out or add a tag for "inactive" (don't we have this) to track things that should not be processed normally.

dougli1sqrd commented 4 years ago

Yeah, we have a key for active I think. Currently we ignore it, but we could start. So in our goa_human case, since there's one for gaf and one for gpad+gpi, one of them must be inactive (or gone completely).

Yeah, if we could devise a schema that detected this, that would be lovely. Otherwise we'll need validation logic. If ontobio detects that there are more than one source type for a dataset, then we'll be unable to proceed and we'll exit with an error I suppose.

kltm commented 4 years ago

Is the issue that there is a single identifier for a dataset so the collide? Either way, even if it cannot be encoded in a schema checker, it can be enforced, see https://github.com/geneontology/go-site/blob/master/scripts/sanity-check-users-and-groups.py

dougli1sqrd commented 4 years ago

Essentially yes. The dataset key is the name of upstream source independent of type. So goa_human.gaf has a dataset name of goa_human. It's what we use in ontobio to determine paths, file names, etc. And it's what we use when grabbing the download url for a type for a dataset. We ask for datasets named X, of type Y, which will filter the list to one. The id field contains the .gaf, etc extension which is messier when determining file names, etc.

As for enforcing scripts, perfect!

kltm commented 4 years ago

I would propose then that this is an issue of policy:

there should be one formatted file per dataset
- if we need to track others (no evidence of this yet), they can be added as "inactive" and ignored by the pipeline
- a script in travis will enforce the one-source-per-identifier rule

pgaudet commented 4 years ago

Do we have cases where there is more than one file ?

kltm commented 4 years ago

Yes, which is what started this thread--see @dougli1sqrd 's example at the top.

dougli1sqrd commented 4 years ago

 -
   id: goa_chicken.gpad
   label: "goa_chicken gpad file"
   description: "gpad file for goa_chicken from EBI Gene Ontology Annotation Database"
   url: http://current.geneontology.org/annotations/goa_chicken.gpad.gz
   type: gpad
   gpi: [ goa_chicken.gpi ]
   dataset: goa_chicken
   submitter: goa
   compression: gzip
   source: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/CHICKEN/goa_chicken.gpa.gz
   entity_type: protein
   status: active
   species_code: Ggal
   taxa:
    - NCBITaxon:208524

dougli1sqrd commented 4 years ago

For specifically enabling gpad 2.0, that tracking will be done here: https://github.com/geneontology/go-site/issues/1453

dougli1sqrd commented 4 years ago

Things that will have to happen in ontobio:

[x] Datasets for a group that will be validated should be done taking into account the new rules regarding the datasets yamls:
- Only one of gpad+gpi, or gaf will be validated for a given dataset name
- Datasets must be status: "active" to be processed.
- For a gpad, there must be a corresponding gpi file
[x] The function produce_gaf in validate.py should be generalized to produce_annotations or similar and would:
- take an input type and output a validated of that type named <dataset>_valid.<type>
- If GPI is needed, incoporate that into the validation at this step
- Writer object currently is defaulted to GafWriter. This should be generalized so it produces the same type as the input type
- evidence types should be standardized for the parser configs for noiea products: ECO vs Evidence Code.
[ ] make_products should generalize in the same way as above:
- generalized parser instead of hardcoded gaf parser
[ ] Should GpadParser require a GPI file in order to work? Or optional, to produce sparse GoAnnotations
- [ ] GpadParser will have to be modified to utilize an incoming GPI either way to properly make GoAnnotations
[ ] Should all dataset types (gaf and gpad) require an incoming upstream GPI?
- Currently we make GPI from GAF. We could also make GPI optional for GAF, but grab it if availble.
[ ] Bundling of gpad+gpi in terms of datasets, parsers, etc will have to be done. This is a different paradigm than currently exists.

dougli1sqrd commented 4 years ago

Bringing in the context of issue #1384: (https://github.com/geneontology/go-site/issues/1384#issuecomment-614943429) we would like to move the Paint, etc mixin process to after we have downloaded and validated all files first.

However, as we see in my comment in #1384:

In ontobio, the order of operations will make this difficult: Currently ontobio operates in this order:

Produce pristine GAF, Make GPI, Mixin datasets (example: paint_fb.gaf merges into fb.gaf), make the rest of our products (gpad, ttl) Step 3 is what this issue addresses. But if step 4 is dependent on step 3, we will need to resolve this difficulty in order to complete this issue.

1) If we want the implementation of this ticket to be easier, theoretically we could tackle #1384 at the same time. This would be easier since the mixin algorithm would not have to be paradigm shifted in the same way as the rest of the validate.py code currently. Separating this out simplifies each case.

2) However, the validation and production model puts in a sticky situation as production of other products depends on the mixin. Pulling out this step like we would like would represent needing to really rethink how we go about validating and producing annotations.

3) One Solution is to continue in the current model, essentially giving up on #1384 in the mid-term and re-orient the existing mixin functionality to support gpad. There's some risk here though in that #1384 would help fix some downloading mid-run issues which occasionally will kill the pipeline mid-way.

Additional notes on this: GPAD implementation as a primary product is yielding ground, but slowly. We've built up a lot of moving parts in the processing system over time, building in certain assumptions, even though we tried to be relatively defensive from the beginning.

I foresee a large testing period after the first "working" PR gets done to ensure that this will pan out the way we would like.

suzialeksander commented 2 months ago

@kltm is this still an outstanding problem?

kltm commented 2 months ago

@suzialeksander This is still "open". Our recent work has been on outputs, not inputs. (I'd note that our roadmap may shift enough that this falls right off.)

geneontology / go-site

Ensure that GPAD (2.0) can be used as a primary input into the pipeline #1443