ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 93 forks source link

Example data is out of date #998

Closed gabrielsaldana closed 8 years ago

gabrielsaldana commented 8 years ago

The example-data.tar file in the documentation at http://www.well.ox.ac.uk/~jk/ga4gh-example-data-v3.0.tar is out of date. It is missing the ontologies folder. Also the datasets/brca1/phenotypes folder.

david4096 commented 8 years ago

The phenotypes folder will need to be folded in with G2P so shouldn't be part of example data, yet. Ontologies, however, does need to be present since it's used by variant annotation.

jeromekelleher commented 8 years ago

Do we have some example data to ship for VA @david4096? We should integrate this into the download_data script so it's easy to generate the example data.

david4096 commented 8 years ago

Yeah there are two annotated VCFs in the compliance test data: https://github.com/ga4gh/compliance/tree/master/test-data

jeromekelleher commented 8 years ago

OK, so what we need to do is update download_data script to pull these from some public location, and create the correct directory hierarchy. Any one like to do this?

kozbo commented 8 years ago

Assigning to Gabriel as this is right on his path anyway (though he is not in the assignee list. @jeromekelleher Should he be added?)

jeromekelleher commented 8 years ago

@kozbo I've invited @gabrielsaldana to join the Global Alliance github group, and added him to the "Global Alliance Contributors" team. He should be on the list of potential assignees once he accepts the invitation.

macieksmuga commented 8 years ago

The example data for sequence annotations can already found in the tests/data/datasets/dataset1/sequenceAnnotations corner of the server repo, so please queue that for chucking in to the example data package once sequence annotations code is accepted.

david4096 commented 8 years ago

How about adding the test data to a release @jeromekelleher ? (thanks @diekhans)

https://help.github.com/articles/creating-releases/

jeromekelleher commented 8 years ago

What have you in mind @david4096? I don't think we should distribute example data with the pypi package.

gabrielsaldana commented 8 years ago

I'm guessing @david4096 is referring to generating downloadable zip files of "releases" from an example-data repository @jeromekelleher instead of the current fixed location mentioned at the top of this thread.

david4096 commented 8 years ago

You can include large files in a release. You would tag a data release, or make a data branch https://help.github.com/articles/distributing-large-binaries/

jeromekelleher commented 8 years ago

I'm perfectly happy with making the data available somewhere other than the current server. I'm not in favour of rolling the example data in with the distribution tarball on pypi (which is not the same thing as the release tarball created on github). Entirely open to ideas otherwise though.

david4096 commented 8 years ago

What do you think about tagging a data release on this repo @jeromekelleher ?

jeromekelleher commented 8 years ago

This will make this repo start inflating pretty quickly though, won't it? If we store the tarball within the repo then we're going to add > 1M to the size of the repo each time we update it. I think this would start adding up pretty quickly, and clones would become slow (this would be bad for CI checks as well as being annoying for devs and for GitHub).

We could try using github pages or something, and using that as our web server? Or make a different repo?

david4096 commented 8 years ago

Sorry if I wasn't being clear, you can add data to a release that isn't a part of your repository's code base.

For example: https://github.com/david4096/server-1/releases/tag/data

jeromekelleher commented 8 years ago

Ah, OK, gotcha. So, how do you provide the data, just upload the file on a web form?

david4096 commented 8 years ago

@jeromekelleher Yeah, it's a drag and drop at the bottom of the description box.

This is for the current release:

https://github.com/david4096/server-1/releases/download/0.2.1-data/ga4gh-example-data.zip

This is for a release after the ontology maps PR https://github.com/ga4gh/server/pull/980.

https://github.com/david4096/server-1/releases/download/moredata/ga4gh-example-data.zip

david4096 commented 8 years ago

You can include the extra data with each release, or use a single "data release" that is refreshed (by deleting the zip and readding it) so the documentation doesn't have to be updated for each release.

jeromekelleher commented 8 years ago

OK, looks good @david4096, let's go with this approach. Can you create the data release please?

-1 on changing to zip files though. Nearly all the data in there is already compressed, so there's no point in compressing it again (just takes longer to extract). Also, tarballs are much more ... Unix anyway.

jeromekelleher commented 8 years ago

@david4096, I'm looking at the download_data script and it doesn't seem to have any code for downloading the data in question. We should keep this script in sync with (and use it to generate) the actual data tarball. If I'm not missing something, can we create an issue to update the script as a prerequisite to make the data release?