arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
319 stars 120 forks source link

Gemini JSON output #748

Closed komalsrathi closed 8 years ago

komalsrathi commented 8 years ago

Hi,

I used the --format json parameter to get output in JSON format:

gemini query --format json -q "select chrom, start, end from variants" my.db | head > test.json

{"start": 17384, "end": 17385, "chrom": "chr1"}
{"start": 63734, "end": 63738, "chrom": "chr1"}
{"start": 69510, "end": 69511, "chrom": "chr1"}
{"start": 69760, "end": 69761, "chrom": "chr1"}
{"start": 137824, "end": 137825, "chrom": "chr1"}
{"start": 182685, "end": 182686, "chrom": "chr1"}
{"start": 183661, "end": 183662, "chrom": "chr1"}
{"start": 183799, "end": 183800, "chrom": "chr1"}
{"start": 186290, "end": 186291, "chrom": "chr1"}
{"start": 187101, "end": 187102, "chrom": "chr1"}

I am using Knitr with R Markdown to convert this JSON output to datatable in an HTML report.


---
title: "Gemini Output"
output: html_document

---
Table1:
```{r table1}
library(jsonlite)
document <- fromJSON(txt = 'test.json')
datatable(document, rownames = FALSE, options = list(
  ajax = list(
    serverSide = TRUE, processing = TRUE,
    dataType = 'jsonp'
  )
))```

But to use a JSON file in fromJSON, I have to format the default JSON output from gemini like this:

# added the square brackets and commas
[{"start": 17384, "end": 17385, "chrom": "chr1"},
{"start": 63734, "end": 63738, "chrom": "chr1"},
{"start": 69510, "end": 69511, "chrom": "chr1"},
{"start": 69760, "end": 69761, "chrom": "chr1"},
{"start": 137824, "end": 137825, "chrom": "chr1"},
{"start": 182685, "end": 182686, "chrom": "chr1"},
{"start": 183661, "end": 183662, "chrom": "chr1"},
{"start": 183799, "end": 183800, "chrom": "chr1"},
{"start": 186290, "end": 186291, "chrom": "chr1"},
{"start": 187101, "end": 187102, "chrom": "chr1"}]

I guess to use the file anywhere I will have to format it to look like a regular JSON file. Is that right or am I missing something?

brentp commented 8 years ago

Because the output may be large, we use line-delimited JSON

brentp commented 8 years ago

... so, yes, you'll have to do some processing, though it looks like jsonlite supports streaming ndjson with stream_in : https://cran.r-project.org/web/packages/jsonlite/jsonlite.pdf

komalsrathi commented 8 years ago

I noticed that the columns are reordered when --format json is specified. Is that on purpose?

$ gemini query -q "select variant_id,gene,chrom,start,ref,alt,dp,type,sub_type,impact from variants where rvis_score<0" --header my4.db | head
variant_id  gene    chrom   start   ref alt dp  type    sub_type    impact
105 KLHL17  chr1    961944  G   C   230 snp tv  synonymous_variant
106 KLHL17  chr1    962183  T   C   5   snp ts  5_prime_UTR_variant
107 KLHL17  chr1    962942  T   C   17  snp ts  intron_variant
108 KLHL17  chr1    964904  C   T   17  snp ts  intron_variant
109 KLHL17  chr1    964905  A   G   17  snp ts  intron_variant
110 KLHL17  chr1    965124  G   C   92  snp tv  synonymous_variant
111 KLHL17  chr1    965336  CTTAT   C   3   indel   del 3_prime_UTR_variant
112 KLHL17  chr1    965349  G   A   3   snp ts  3_prime_UTR_variant
139 ISG15   chr1    1002307 T   C   2   snp ts  intron_variant
$ gemini query -q "select variant_id,gene,chrom,start,ref,alt,dp,type,sub_type,impact from variants where rvis_score<0" --header --format json my4.db | head
{"impact": "synonymous_variant", "type": "snp", "ref": "G", "start": 961944, "variant_id": 105, "alt": "C", "gene": "KLHL17", "chrom": "chr1", "sub_type": "tv", "dp": 230}
{"impact": "5_prime_UTR_variant", "type": "snp", "ref": "T", "start": 962183, "variant_id": 106, "alt": "C", "gene": "KLHL17", "chrom": "chr1", "sub_type": "ts", "dp": 5}
{"impact": "intron_variant", "type": "snp", "ref": "T", "start": 962942, "variant_id": 107, "alt": "C", "gene": "KLHL17", "chrom": "chr1", "sub_type": "ts", "dp": 17}
{"impact": "intron_variant", "type": "snp", "ref": "C", "start": 964904, "variant_id": 108, "alt": "T", "gene": "KLHL17", "chrom": "chr1", "sub_type": "ts", "dp": 17}
{"impact": "intron_variant", "type": "snp", "ref": "A", "start": 964905, "variant_id": 109, "alt": "G", "gene": "KLHL17", "chrom": "chr1", "sub_type": "ts", "dp": 17}
{"impact": "synonymous_variant", "type": "snp", "ref": "G", "start": 965124, "variant_id": 110, "alt": "C", "gene": "KLHL17", "chrom": "chr1", "sub_type": "tv", "dp": 92}
{"impact": "3_prime_UTR_variant", "type": "indel", "ref": "CTTAT", "start": 965336, "variant_id": 111, "alt": "C", "gene": "KLHL17", "chrom": "chr1", "sub_type": "del", "dp": 3}
{"impact": "3_prime_UTR_variant", "type": "snp", "ref": "G", "start": 965349, "variant_id": 112, "alt": "A", "gene": "KLHL17", "chrom": "chr1", "sub_type": "ts", "dp": 3}
{"impact": "intron_variant", "type": "snp", "ref": "T", "start": 1002307, "variant_id": 139, "alt": "C", "gene": "ISG15", "chrom": "chr1", "sub_type": "ts", "dp": 2}
{"impact": "intron_variant", "type": "snp", "ref": "C", "start": 1004715, "variant_id": 140, "alt": "T", "gene": "ISG15", "chrom": "chr1", "sub_type": "ts", "dp": 2}
brentp commented 8 years ago

It is because it converts to a dictionary (which is unordered) before converting to json. json objects are also inherently unordered by the spec.

komalsrathi commented 8 years ago

Got it. I am probably going to reorder the columns in the dataframe after importing the JSON object. Thanks!