datacarpentry / organization-genomics

Project Organization and Management for Genomics
https://datacarpentry.org/organization-genomics
Other
23 stars 76 forks source link

update example spreadsheet in 01-tidiness #56

Closed tracykteal closed 6 years ago

tracykteal commented 6 years ago

In the 01-tidiness lesson, we have an example of a spreadsheet and ask learners to find some things that are wrong with it. The example spreadsheet is field data. It would be better to have some metadata that is more like what people would be using in a genomics experiment. So, we could create a more relevant messy spreadsheet for this exercise.

davis68 commented 6 years ago

In addition, it would be nice to have a cleaned version of this spreadsheet ready to show.

hoytpr commented 6 years ago

Would you want a genomics data spreadsheet that fits closely with the lesson, or some common spreadsheet issues that people working with genomics data face often? It's common to have 30,000 items in any column, but mistakes in reformatted data could have "bad" entries or missing entries. It's not realistic to scroll through tens of thousands of rows, so ways to check for bad entries are important. For example using commands like "COUNTBLANK" or converting numbers stored as text, or quick ways to find duplicates with more than one column entry. Is this what people would want? It's not exactly the same as Tidy Data. If you want to emulate the Tidy Data lesson, one could track DNA/exome sequencing experiments, RNA-seq, and annotation files/outputs.

johnsolk commented 6 years ago

This is the lesson being referenced: http://www.datacarpentry.org/organization-genomics/01-tidiness/

I was thinking it might be a nice to have a sample sheet that people submit to DNA sequencing facilities which then has to be formatted specifically for the instrument, e.g. bcl2fastq.

I used to do this for people at the core where I used to work and have a few examples of messy and cleaned versions of sheets, but these are from private investigators and would need to ask permission to use. When I taught this lesson in Jan, I showed some of these sheets and seemed to be an interesting example for participants.

johnsolk commented 6 years ago

@olgabot was kind to donate an example! Messy/clean versions of a sample spreadsheet for bcl2fastq (attached). Will work on adding in a pull request for review.

https://twitter.com/olgabot/status/978748663873679361

OlgaBotvinnik_czbiohub_SampleSheet_Examples_DCGenomics.zip

hoytpr commented 6 years ago

That's a great contribution. I deal with sample sheets daily, and didn't think of this. There are other things to consider too e.g. which columns define the 'project' vs. 'sample' and the folder structure that results.

tracykteal commented 6 years ago

Closing this, as it was addressed by #60. Thanks @ljcohen!