Open naupaka opened 7 years ago
That's a great idea! I think @JasonJWilliamsNY had also tried something like this. We definitely should do this.
I should also mention that the Rmd template file I mentioned on issue #47 also has some lines of code to parse out some of the information out of the nasty-looking INFO
column so there's even more to work with. I tried a couple different VCF libraries to try and get things into a workable, tidy form - there is a vcf2tidy
function in the vcfR
package, but I was unable to get it to work/do what I wanted, hence the hackish way we ended up doing it. vcf2tidy
would be the most straightforward approach to produce this dataset I think, if we could make it work.
Having a richer dataset to analyze might allow us to better showcase the abilities of dplyr and ggplot. One to consider might be a csv version of the results of the VCF pipeline. We used this approach in a recent workshop at Stanford (script:
parse_vcf.R
, csv:all_vcf.csv
). The R script and the csv it outputs are available here.