Closed tracykteal closed 6 years ago
In addition, it would be nice to have a cleaned version of this spreadsheet ready to show.
Would you want a genomics data spreadsheet that fits closely with the lesson, or some common spreadsheet issues that people working with genomics data face often? It's common to have 30,000 items in any column, but mistakes in reformatted data could have "bad" entries or missing entries. It's not realistic to scroll through tens of thousands of rows, so ways to check for bad entries are important. For example using commands like "COUNTBLANK" or converting numbers stored as text, or quick ways to find duplicates with more than one column entry. Is this what people would want? It's not exactly the same as Tidy Data. If you want to emulate the Tidy Data lesson, one could track DNA/exome sequencing experiments, RNA-seq, and annotation files/outputs.
This is the lesson being referenced: http://www.datacarpentry.org/organization-genomics/01-tidiness/
I was thinking it might be a nice to have a sample sheet that people submit to DNA sequencing facilities which then has to be formatted specifically for the instrument, e.g. bcl2fastq.
I used to do this for people at the core where I used to work and have a few examples of messy and cleaned versions of sheets, but these are from private investigators and would need to ask permission to use. When I taught this lesson in Jan, I showed some of these sheets and seemed to be an interesting example for participants.
@olgabot was kind to donate an example! Messy/clean versions of a sample spreadsheet for bcl2fastq (attached). Will work on adding in a pull request for review.
That's a great contribution. I deal with sample sheets daily, and didn't think of this. There are other things to consider too e.g. which columns define the 'project' vs. 'sample' and the folder structure that results.
Closing this, as it was addressed by #60. Thanks @ljcohen!
In the 01-tidiness lesson, we have an example of a spreadsheet and ask learners to find some things that are wrong with it. The example spreadsheet is field data. It would be better to have some metadata that is more like what people would be using in a genomics experiment. So, we could create a more relevant messy spreadsheet for this exercise.