datacarpentry / R-genomics

Lesson on data analysis and visualization in R for genomics
http://datacarpentry.github.io/R-genomics
Other
40 stars 76 forks source link

Consider using a richer dataset? #46

Open naupaka opened 7 years ago

naupaka commented 7 years ago

Having a richer dataset to analyze might allow us to better showcase the abilities of dplyr and ggplot. One to consider might be a csv version of the results of the VCF pipeline. We used this approach in a recent workshop at Stanford (script: parse_vcf.R, csv: all_vcf.csv). The R script and the csv it outputs are available here.

tracykteal commented 7 years ago

That's a great idea! I think @JasonJWilliamsNY had also tried something like this. We definitely should do this.

naupaka commented 7 years ago

I should also mention that the Rmd template file I mentioned on issue #47 also has some lines of code to parse out some of the information out of the nasty-looking INFO column so there's even more to work with. I tried a couple different VCF libraries to try and get things into a workable, tidy form - there is a vcf2tidy function in the vcfR package, but I was unable to get it to work/do what I wanted, hence the hackish way we ended up doing it. vcf2tidy would be the most straightforward approach to produce this dataset I think, if we could make it work.