need a dataset - Githubissues

tracykteal commented 9 years ago

[ ] need raw data un-quality filtered for at least two genomes
[ ] need a SNP matrix
[ ] data and metadata should be available on Dryad or FigShare

tracykteal commented 9 years ago

An idea is to use the Lenski dataset on the paper for finding the citrate mutations

Paper: http://www.nature.com/nature/journal/v489/n7417/full/nature11514.html Dryad: http://datadryad.org/resource/doi:10.5061/dryad.8q6n4

JasonJWilliamsNY commented 9 years ago

I think the Lenski data set is perfect. All the raw data are available in the SRA (http://www.ncbi.nlm.nih.gov/bioproject/PRJNA188723) and we can guide students through a subset for several comparative genomics applications and visualizations.

hlapp commented 9 years ago

I think the Lenski data set is perfect.

:+1:

tracykteal commented 9 years ago

It turns out @ctb has a lesson on variant calling using the Lenski data set

http://2013-cse801.readthedocs.org/en/latest/hw5.html

It would be great to have an ecosystem of lessons like this around this data that we can point people to even if we're not including the topic in the workshop.

juanu commented 9 years ago

I think this is a great dataset!. My only question/concern is that doing comparative genomics and variant calling, is more of an application...but nevertheless, shows the complete process from raw data to results/visualization. Depending on the students background with Python and/or R, visualization could be one of the great challenges with this type of dataset.

I also like the idea of lessons related to this research. There are a few nice pipelines that do this process for you, such Snippy (https://github.com/tseemann/snippy), but it will be of great value to make the whole process transparent to the students, and show how things work.

k8hertweck commented 9 years ago

I think this dataset will work well. As a followup to @juanu's comment, it's likely that the biological questions in this dataset will be very different from most learners (who may be working in non-model systems, etc), but the nature of SNP calling means we'll be able to model an entire workflow during the workshop (which is great!). I think we should be sure to include some language throughout the lessons to reinforce transferability of methods/data types/etc from these lessons to other projects (i.e., should always perform quality assessment, and these file formats occur in most analyses).

sjmillerAZ commented 9 years ago

Were you planning on including some of the paired end reads? I think we had paired reads in mind when we worked on the Data Wrangling topics.

williamsmicrobegenome commented 9 years ago

Looks great to me! I agree that this dataset provides a strong framework and narrative for the different lessons. From a data organization standpoint, I think this dataset also shows the importance of a clear organization system for projects. For example, BioSample SAMN01095549 is a mutant of a clone isolated from a certain timepoint, etc. etc. Learners can see how easily a "simple" experiment can snowball into lots of samples and how considering organization at the beginning can prevent pain in the future.

tracykteal commented 9 years ago

Thanks @williamsmicrobegenome I hadn't even thought of it from the data organization perspective. That will be great to work in.

@sjmillerAZ the dataset includes both paired end and single end reads, so we can do both if it doesn't seem to confusing, or just paired end if that makes the most sense.

tracykteal commented 9 years ago

@k8hertweck yes the questions are likely different, and as you say, I think the point is to emphasize the universality of the workflow. We probably in fact will skip doing the SNP calling in class and just go from the quality filtering to the SNP calling output matrix, saying 'insert your bioinformatics workflow' here, so it doesn't seem too specific. We can see how this works though.

Ideally we'd also be able to point them at good documentation or tutorials that we're aware of on various bioinformatics workflows.

devbioinfoguy commented 9 years ago

I think all the above comments are great. And the paper looks very interesting as well -- 25 years! Pretty amazing! One naive question -- are we having them grab just one of the samples, or all 4 (at this SRA entry)? Cloud VMs/HPC sites will already have this downloaded for them, yes?

jrherr commented 9 years ago

I know I am really late to the party here, but I am in agreement with the datasets (especially the Lenski data) and think they are great examples. I also think the points brought up here by everyone are really good points.

My only concern is the timing of the exercises -- do we need to sample down the data so that this can be covered in a two-day workshop and how will that affect the results the participants will get. I guess this will be all worked out in test runs of the tutorials.

ctb commented 9 years ago

On Sun, Jun 14, 2015 at 12:42:25PM -0700, Josh Herr wrote:

I know I am really late to the party here, but I am in agreement with the datasets (especially the Lenski data) and think they are great examples. I also think the points brought up here by everyone are really good points.

My only concern is the timing of the exercises -- do we need to sample down the data so that this can be covered in a two-day workshop and how will that affect the results the participants will get. I guess this will be all worked out in test runs of the tutorials.

The data analysis doesn't take that long to run, and you can't downsample the data much without dropping the SNPs.

devbioinfoguy commented 9 years ago

Thanks, Josh, for bringing up the thought of downsampling — that was on my mind as well.

And thanks, Titus, for the clarification.

-b

On Jun 14, 2015, at 3:43 PM, C. Titus Brown notifications@github.com wrote:

On Sun, Jun 14, 2015 at 12:42:25PM -0700, Josh Herr wrote:

I know I am really late to the party here, but I am in agreement with the datasets (especially the Lenski data) and think they are great examples. I also think the points brought up here by everyone are really good points.

My only concern is the timing of the exercises -- do we need to sample down the data so that this can be covered in a two-day workshop and how will that affect the results the participants will get. I guess this will be all worked out in test runs of the tutorials.

The data analysis doesn't take that long to run, and you can't downsample the data much without dropping the SNPs. — Reply to this email directly or view it on GitHub https://github.com/datacarpentry/genomics-workshop/issues/1#issuecomment-111869867.

Bob Freeman, Ph.D. Research Computing Facilitator XSEDE Campus Champion FAS Research Computing Harvard University 38 Oxford Street, Rm 105 Cambridge, MA 02138

617/495.8824, vox

@DevBioInfoGuy About: http://bit.ly/1m8n0se

tracykteal commented 9 years ago

It seems like there is general agreement that this dataset will work! I started a page about the dataset https://github.com/datacarpentry/genomics-workshop/blob/gh-pages/dataset.md and am getting the data files together.

ctb commented 9 years ago

Note to @tracykteal et al: you can download FASTQ files directly from the European Nucleotide Archive; see files link, here:

http://www.ebi.ac.uk/ena/data/view/SRR098042

apawlik commented 9 years ago

Just a question, are we storing all data on figshare or create a repository (like https://github.com/datacarpentry/data-ecology) for some small sample datasets (manually created)? @acorpuz created some exercises and small sample files to go with them. For now I've created https://github.com/datacarpentry/data-genomics and will ask @acorpuz to put his files there - we can move them later to figshare.

Sorry if I missed the decision emails (again :-( )!

tracykteal commented 9 years ago

While lessons are being reviewed & developed, we're storing them in temporary public locations. Then we'll have them in a public repository somewhere.

See https://github.com/datacarpentry/planning/issues/13

datacarpentry / genomics-workshop

need a dataset #1