Open dougli1sqrd opened 3 years ago
Does the schema specificy which fields are mandatory? @vanaukenk @suzialeksander and I are in the process of reviewing the data flow of each group, we'd like to know what we need to look out for.
Thanks, Pascale
@pgaudet The schema is at https://github.com/geneontology/go-site/blob/master/metadata/datasets.schema.yaml That said, this is a specific--currently non-dangerous--bug in the downloader code.
Thanks @kltm
So the schema should change to have
"compression": type: str required: false -> true
Is that right ?
Thanks, Pascale
No, the schema is correct, as is pombase's dataset yaml file. The bug is around how one of our tools is handling that file. The bug was merely exposed on the pombase.yaml file. But the bug is in the tool. pombase.yaml here is correct.
OK thanks!
Semi-recently pombase updated their
pombase.yaml
dataset metadata entry that we use to drive the pipeline to include uncompressed gpad/gpi files. The gaf entry is unaffected.To illustrate:
We see that the
compression
is the empty string (a valid value), meaning this file is not compressed upon download. This field tells tools to not attempt to decompress the file, additionally the compression field affects the path for download. Ifcompression
isgzip
, for example, when downloaded, the file name will be<dataset>-src.<type>.gz
, or concretelypombase-src.gaf.gz
Here's where the bug enters. The above path construction is general over whatever exists in the
compression
field if not None. So for the the above gpi case, paths that are generated upon downloaded look likepombase-src.gpad.
-- with a trailing period.This then affects all other path calculations. For example the downloader can be instructed to ensure that every source is both zipped and unzipped for convenience. In this case for the above entry,
pombase-src.gpad.
zipped becomespombase-src.gpad..gz
-- with two periods between gpad and gz.So far this hasn't affected anyone because the pombase change above is fairly recent and the pipeline currently does not deal in gpad/gpi sources as well as all other sources being provided as zipped. As we switch into gpi/gpad though, this will become an issue with pombase, and if anyone decided to provide their gaf uncompressed this would show up.