Illumina / Nirvana

The nimble & robust variant annotator
https://illumina.github.io/NirvanaDocumentation/
GNU General Public License v3.0
170 stars 44 forks source link

JSON parsing. Am I on crazy pills? #35

Open davemcg opened 4 years ago

davemcg commented 4 years ago

R

library(tidyverse)
library(RJSONIO)
anno <- RJSONIO::fromJSON('/Users/mcgaugheyd/Desktop/sample717.json.gz')

anno_tib <- anno$positions %>% bind_rows() %>% as_tibble()

anno_tib %>% dim()

This returns 152,392 variants.

bash

zcat sample717_SmallVariants.genome.vcf.gz | grep PASS | wc -l

This returns 1,811,351 variants

Hmm. The vcf is full of 0/0 and 1/. Perhaps I should remove those?

bash

zcat sample717__SmallVariants.genome.vcf.gz | grep PASS | grep -v "0/0" | grep -v "0/." | wc -l

This returns 490 variants.

OK, let's look at the json file again.

R

anno_tib %>% filter(filters == 'PASS') %>% dim()

This returns 1288 variants.


Can you give some pointers on how to line the json and vcf up?

davemcg commented 4 years ago

Is the partial answer that intergenic variants don't get annotated? When I get around to figuring out how to left_join these together I'll figure this out - but right now I'm not 100% confident the json is holding all the variants.

haochenl commented 4 years ago

Hi David,

Nirvana will skip the reference call variants from the genome VCF file. Then reason is mainly to control the size of the JSON output.

davemcg commented 4 years ago

Is there some way you recommend pulling out the format fields?

Since you didn't put them into the "outside" of the file (like a vcf would have in the header. You did this for some stuff like the gene info) the only way I can figure out how to extract (for example, the 'cosmic' fields) is to do a kind of computationally expensive extraction where I flatten the entire column and run a grep.

Example:

R

anno_tib %>% pull(variants) %>% unlist() %>% names() %>% grep('cosmic', ., ignore.case = TRUE, value = TRUE) %>% unique() %>% sort()
apeltzer commented 2 years ago

@davemcg - did you find a solution for this that worked for you? I've given the two Jupyter Notebooks that are part of the docs a try but didn't really get proper output that I could use downstream.