USCbiostats / AnnoQR

R client wrap for AnnoQ API
Other
1 stars 2 forks source link

AnnoQR: R client for AnnoQ Variant Query

Introduction

This is an R client for performing queries with AnnoQ API.

Installation

Install from github. Make sure you have installed devtools.

install.packages("devtools")

Then

library(devtools)

install_github("USCbiostats/AnnoQR")

Function list

Examples

Query Variants with ANNOVAR_ensembl_Effect Annotation

library(AnnoQR)
q = init_query_js_body()
ex = exists_filter("ANNOVAR_ensembl_Effect")
q = add_query_filter(q, ex)
variants = perform_search(q)
variants

Only retrieve ANNOVAR_ensembl_Effect column

q = add_source(q, c("ANNOVAR_ensembl_Effect"))
variants = perform_search(q)
variants

Query variants field SnpEff_ensembl_Effect marked as intergenic_region

q = init_query_js_body()
term_f = term_filter('SnpEff_ensembl_Effect' , 'intergenic_region')
q = add_query_filter(q, term_f)
variants = perform_search(q)
variants

Query variants field SnpEff_ensembl_Effect marked as `intergenic_region with in chromosome 20

q = init_query_js_body()
term_f1 = term_filter('SnpEff_ensembl_Effect' , 'intergenic_region')
term_f2 = term_filter('chr' , '20')
q = add_query_filter(q, term_f1)
q = add_query_filter(q, term_f2)
#q = add_source(q, c('SnpEff_ensembl_Effect'))
variants = perform_search(q)
variants

Query variants with 1000 genome allel count 1000Gp3_AC larger than 5

q = init_query_js_body()
range_f = range_filter(key='1000Gp3_AC' , gt=5)
q = add_query_filter(q, range_f)
variants = perform_search(q)
variants

Chromosome range query

variants = regionQuery(contig = '20', start=31710367, end=31820367)
variants

rsID query

variant = rsidQuery('rs193031179')
variant

keywordsQuery

keywordsQuery('protein_coding')

Guidance on Using Our Elasticsearch-based API

Our API leverages the powerful features of Elasticsearch, but it's important to be aware of certain behaviors related to query results:

Default Behavior with perform_search(q)

For example, the following R code snippet demonstrates how to use perform_search(q) effectively:

q = init_query_js_body()
ex = exists_filter("ANNOVAR_ensembl_Effect")
q = add_query_filter(q, ex)
variants = perform_search(q)
hits = variants$hits$hits
length(hits$`_index`)

Running this snippet typically results in:

10

This indicates the successful retrieval of the first 10 matches, aligning with Elasticsearch's default result limit.

Retrieving All Matches with perform_search_with_count(q)

Consider this code snippet:

q = init_query_js_body()
ex = exists_filter("ANNOVAR_ensembl_Effect")
q = add_query_filter(q, ex)
variants = perform_search_with_count(q)

This may result in an error if the result set is too large:

Error in perform_search_with_count(q) : Bad Request (HTTP 400)

The error is attributed to the attempt of perform_search_with_count(q) to retrieve all matches, surpassing Elasticsearch's maximum limit.

Diagnosing Large Queries with perform_search_find_count(q)

To ascertain the size of your query's result set, you can use:

q = init_query_js_body()
ex = exists_filter("ANNOVAR_ensembl_Effect")
q = add_query_filter(q, ex)
variants = perform_search_find_count(q)

This will generate:

Debug: Query JSON:
 {"query":{"bool":{"filter":[{"exists":{"field":"ANNOVAR_ensembl_Effect"}}]}},"size":40405505} 
Response [http://annoq.org/api/annoq-test/_search]
  Date: 
  Status: 400
  Content-Type: application/json;charset=utf-8
  Size: 1.48 kB

Here, the "size" parameter exceeds 40 million, explaining the error. Such a large result set is beyond the permissible range for a single Elasticsearch query.

Managing Large Datasets