Add sample datasets that work out-of-the-box with the pipeline

NCBI-Hackathons / Pharmacogenomics_Prediction_Pipeline_P3

Creative Commons Zero v1.0 Universal

28 stars 14 forks source link

Add sample datasets that work out-of-the-box with the pipeline #5

Closed khughitt closed 9 years ago

khughitt commented 9 years ago

If possible, it might be nice to ship, or automatically download some small sample datasets so that users can test out the pipeline even if they don't have any data.

This will also be useful for development (in particular testing).

There are numerous datasets already on Bioconductor (e.g. http://www.bioconductor.org/packages/release/data/experiment/) so perhaps we can find a few ones there that would fit our purposes.

Incorporating external datasets would also be a good way to ensure that the pipeline is actually generalizable to other datasets and doesn't depend too much on specific features of any one dataset.

ammichalowski commented 9 years ago

I'll prepare the sample datasets

khughitt commented 9 years ago

Great!

If you are going to include your own data, we could put it in either samples/ or doc/samples?

seandavi commented 9 years ago

If the files are large, it might be better to NOT include them in the git repository. Perhaps a subset would be better?

Sean

On Fri, Aug 7, 2015 at 7:51 AM, Keith Hughitt notifications@github.com wrote:

Great!

If you are going to include your own data, we could put it in either samples/ or doc/samples?

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/Pharmacogenomics_Prediction_Pipeline_P3/issues/5#issuecomment-128682107 .

khughitt commented 9 years ago

Agreed. I would try to only include files less than ~1Mb. This is why it may be good (if possible) to use datasets already on Bioconductor -- then the library can just be loaded and the sample dataset will be available. It requires, however, that there are appropriate datasets we can use already out there.

seandavi commented 9 years ago

If you have data, we can easily make a bioconductor data package. On Aug 9, 2015 8:24 AM, "Keith Hughitt" notifications@github.com wrote:

Agreed. I would try to only include files less than ~1Mb. This is why it may be good (if possible) to use datasets already on Bioconductor -- then the library can just be loaded and the sample dataset will be available. It requires, however, that there are appropriate datasets we can use already out there.

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/Pharmacogenomics_Prediction_Pipeline_P3/issues/5#issuecomment-129174331 .

ammichalowski commented 9 years ago

Since our study has not been published yet, it seems for now we could use a small subset of permuted data with mock cell line/compound names.

I've prepared an example data and copied them to sample_in_progress (raw.zip, ~40kb). I've also added a readme file there.

daler commented 9 years ago

Thanks, this is great.

The unzipped size is still pretty small (212K). I also subsetted the GO, MSIG, and CPDB annotation files to only have data for the 100 genes in the example data. Everything together weighs in at 760K uncompressed.

To make things easier for version control and testing, are there any objections to unzipping these 760K into an "example_data" dir in the top-level dir of the repo?

seandavi commented 9 years ago

You could also use S3 versioning:

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketPUTVersioningStatus.html

Sean

On Mon, Aug 10, 2015 at 11:56 AM, Ryan Dale notifications@github.com wrote:

Thanks, this is great.

The unzipped size is still pretty small (212K). I also subsetted the GO, MSIG, and CPDB annotation files to only have data for the 100 genes in the example data. Everything together weighs in at 760K uncompressed.

To make things easier for version control and testing, are there any objections to unzipping these 760K into an "example_data" dir in the top-level dir of the repo?

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/Pharmacogenomics_Prediction_Pipeline_P3/issues/5#issuecomment-129505279 .

khughitt commented 9 years ago

Done. Thanks, @ammhub!