geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
45 stars 89 forks source link

GAF/GPAD downloader script generates incorrect paths when the dataset metadata `compression` field is the empty string #1621

Open dougli1sqrd opened 3 years ago

dougli1sqrd commented 3 years ago

Semi-recently pombase updated their pombase.yaml dataset metadata entry that we use to drive the pipeline to include uncompressed gpad/gpi files. The gaf entry is unaffected.

To illustrate:

   id: pombase.gpi
   label: "pombase gpi file"
   description: "gpi file for pombase from PomBase"
   url: http://current.geneontology.org/annotations/pombase.gpi.gz
   type: gpi
   dataset: pombase
   submitter: pombase
   compression: ''
   source: https://www.pombase.org/data/annotations/Gene_ontology/pombase.gpi
   entity_type:
   status: active
   species_code: Spom
   taxa:
    - NCBITaxon:284812
    - NCBITaxon:4896

We see that the compression is the empty string (a valid value), meaning this file is not compressed upon download. This field tells tools to not attempt to decompress the file, additionally the compression field affects the path for download. If compression is gzip, for example, when downloaded, the file name will be <dataset>-src.<type>.gz, or concretely pombase-src.gaf.gz

Here's where the bug enters. The above path construction is general over whatever exists in the compression field if not None. So for the the above gpi case, paths that are generated upon downloaded look like pombase-src.gpad. -- with a trailing period.

This then affects all other path calculations. For example the downloader can be instructed to ensure that every source is both zipped and unzipped for convenience. In this case for the above entry, pombase-src.gpad. zipped becomes pombase-src.gpad..gz -- with two periods between gpad and gz.

So far this hasn't affected anyone because the pombase change above is fairly recent and the pipeline currently does not deal in gpad/gpi sources as well as all other sources being provided as zipped. As we switch into gpi/gpad though, this will become an issue with pombase, and if anyone decided to provide their gaf uncompressed this would show up.

pgaudet commented 3 years ago

Does the schema specificy which fields are mandatory? @vanaukenk @suzialeksander and I are in the process of reviewing the data flow of each group, we'd like to know what we need to look out for.

Thanks, Pascale

kltm commented 3 years ago

@pgaudet The schema is at https://github.com/geneontology/go-site/blob/master/metadata/datasets.schema.yaml That said, this is a specific--currently non-dangerous--bug in the downloader code.

pgaudet commented 3 years ago

Thanks @kltm

So the schema should change to have

"compression": type: str required: false -> true

Is that right ?

Thanks, Pascale

dougli1sqrd commented 3 years ago

No, the schema is correct, as is pombase's dataset yaml file. The bug is around how one of our tools is handling that file. The bug was merely exposed on the pombase.yaml file. But the bug is in the tool. pombase.yaml here is correct.

pgaudet commented 3 years ago

OK thanks!