Closed gabrielsaldana closed 8 years ago
The phenotypes folder will need to be folded in with G2P so shouldn't be part of example data, yet. Ontologies, however, does need to be present since it's used by variant annotation.
Do we have some example data to ship for VA @david4096? We should integrate this into the download_data script so it's easy to generate the example data.
Yeah there are two annotated VCFs in the compliance test data: https://github.com/ga4gh/compliance/tree/master/test-data
OK, so what we need to do is update download_data script to pull these from some public location, and create the correct directory hierarchy. Any one like to do this?
Assigning to Gabriel as this is right on his path anyway (though he is not in the assignee list. @jeromekelleher Should he be added?)
@kozbo I've invited @gabrielsaldana to join the Global Alliance github group, and added him to the "Global Alliance Contributors" team. He should be on the list of potential assignees once he accepts the invitation.
The example data for sequence annotations can already found in the tests/data/datasets/dataset1/sequenceAnnotations
corner of the server repo, so please queue that for chucking in to the example data package once sequence annotations code is accepted.
How about adding the test data to a release @jeromekelleher ? (thanks @diekhans)
What have you in mind @david4096? I don't think we should distribute example data with the pypi package.
I'm guessing @david4096 is referring to generating downloadable zip files of "releases" from an example-data repository @jeromekelleher instead of the current fixed location mentioned at the top of this thread.
You can include large files in a release. You would tag a data release, or make a data branch https://help.github.com/articles/distributing-large-binaries/
I'm perfectly happy with making the data available somewhere other than the current server. I'm not in favour of rolling the example data in with the distribution tarball on pypi (which is not the same thing as the release tarball created on github). Entirely open to ideas otherwise though.
What do you think about tagging a data release on this repo @jeromekelleher ?
This will make this repo start inflating pretty quickly though, won't it? If we store the tarball within the repo then we're going to add > 1M to the size of the repo each time we update it. I think this would start adding up pretty quickly, and clones would become slow (this would be bad for CI checks as well as being annoying for devs and for GitHub).
We could try using github pages or something, and using that as our web server? Or make a different repo?
Sorry if I wasn't being clear, you can add data to a release that isn't a part of your repository's code base.
For example: https://github.com/david4096/server-1/releases/tag/data
Ah, OK, gotcha. So, how do you provide the data, just upload the file on a web form?
@jeromekelleher Yeah, it's a drag and drop at the bottom of the description box.
This is for the current release:
https://github.com/david4096/server-1/releases/download/0.2.1-data/ga4gh-example-data.zip
This is for a release after the ontology maps PR https://github.com/ga4gh/server/pull/980.
https://github.com/david4096/server-1/releases/download/moredata/ga4gh-example-data.zip
docs/installation.rst
You can include the extra data with each release, or use a single "data release" that is refreshed (by deleting the zip and readding it) so the documentation doesn't have to be updated for each release.
OK, looks good @david4096, let's go with this approach. Can you create the data release please?
-1 on changing to zip files though. Nearly all the data in there is already compressed, so there's no point in compressing it again (just takes longer to extract). Also, tarballs are much more ... Unix anyway.
@david4096, I'm looking at the download_data script and it doesn't seem to have any code for downloading the data in question. We should keep this script in sync with (and use it to generate) the actual data tarball. If I'm not missing something, can we create an issue to update the script as a prerequisite to make the data release?
The example-data.tar file in the documentation at http://www.well.ox.ac.uk/~jk/ga4gh-example-data-v3.0.tar is out of date. It is missing the ontologies folder. Also the datasets/brca1/phenotypes folder.