Closed khughitt closed 9 years ago
I'll prepare the sample datasets
Great!
If you are going to include your own data, we could put it in either samples/
or doc/samples
?
If the files are large, it might be better to NOT include them in the git repository. Perhaps a subset would be better?
Sean
On Fri, Aug 7, 2015 at 7:51 AM, Keith Hughitt notifications@github.com wrote:
Great!
If you are going to include your own data, we could put it in either samples/ or doc/samples?
— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/Pharmacogenomics_Prediction_Pipeline_P3/issues/5#issuecomment-128682107 .
Agreed. I would try to only include files less than ~1Mb. This is why it may be good (if possible) to use datasets already on Bioconductor -- then the library can just be loaded and the sample dataset will be available. It requires, however, that there are appropriate datasets we can use already out there.
If you have data, we can easily make a bioconductor data package. On Aug 9, 2015 8:24 AM, "Keith Hughitt" notifications@github.com wrote:
Agreed. I would try to only include files less than ~1Mb. This is why it may be good (if possible) to use datasets already on Bioconductor -- then the library can just be loaded and the sample dataset will be available. It requires, however, that there are appropriate datasets we can use already out there.
— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/Pharmacogenomics_Prediction_Pipeline_P3/issues/5#issuecomment-129174331 .
Since our study has not been published yet, it seems for now we could use a small subset of permuted data with mock cell line/compound names.
I've prepared an example data and copied them to sample_in_progress (raw.zip, ~40kb). I've also added a readme file there.
Thanks, this is great.
The unzipped size is still pretty small (212K). I also subsetted the GO, MSIG, and CPDB annotation files to only have data for the 100 genes in the example data. Everything together weighs in at 760K uncompressed.
To make things easier for version control and testing, are there any objections to unzipping these 760K into an "example_data" dir in the top-level dir of the repo?
You could also use S3 versioning:
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketPUTVersioningStatus.html
Sean
On Mon, Aug 10, 2015 at 11:56 AM, Ryan Dale notifications@github.com wrote:
Thanks, this is great.
The unzipped size is still pretty small (212K). I also subsetted the GO, MSIG, and CPDB annotation files to only have data for the 100 genes in the example data. Everything together weighs in at 760K uncompressed.
To make things easier for version control and testing, are there any objections to unzipping these 760K into an "example_data" dir in the top-level dir of the repo?
— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/Pharmacogenomics_Prediction_Pipeline_P3/issues/5#issuecomment-129505279 .
Done. Thanks, @ammhub!
If possible, it might be nice to ship, or automatically download some small sample datasets so that users can test out the pipeline even if they don't have any data.
This will also be useful for development (in particular testing).
There are numerous datasets already on Bioconductor (e.g. http://www.bioconductor.org/packages/release/data/experiment/) so perhaps we can find a few ones there that would fit our purposes.
Incorporating external datasets would also be a good way to ensure that the pipeline is actually generalizable to other datasets and doesn't depend too much on specific features of any one dataset.