genome reference assembly extraction from variants/reads objects?

ttriche commented 10 years ago

One of the handy defaults in Bioconductor GRanges/VRanges objects is a slot for the genome (reference assembly) of each chromosome (should all be the same for any sane object, of course). This helps prevent comparisons of (e.g.) hg18 and hg19, or hg19 and hg38, data as if on the same coordinate system (a practice which is such a terrible idea that it's essentially never worth allowing).

Unfortunately, this safeguard can't be enforced when no genome is specified.

In the course of adding a default seqlevelsStyle for *Ranges, I realized it would be nice to have the genome reference automatically specified when retrieving variants or aligned reads. This doesn't seem to be possible from the current data structure. What am I overlooking here?

pgrosu commented 10 years ago

Hi Tim,

Actually I was able to find it using the java api client using this command:

java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["10473108253681171589"]}' --pretty_print | less

Then in the the metadata section you will see listed as GRCh37 like this:

...
 "metadata" : [ {
      "key" : "FORMAT",
      "id" : "DS",
      "type" : "float",
      "number" : "1",
      "description" : "Genotype dosage from MaCH/Thunder"
    }, {
      "key" : "reference",
      "value" : "GRCh37",
      "type" : "unknownType"
    }, {
...

Hope it helps, ~p

ttriche commented 10 years ago

That is terrific; it's in there, at least.

This sort of thing may become more of an issue (generally) as "variant" calls move towards graph traversals from assembly A to variant assembly representation B. But at least there's a way to extract it and prevent some issues when using the data from R, which avoids certain types of derp.

Thank you!

--t

On Oct 29, 2014, at 7:58 PM, Paul Grosu notifications@github.com wrote:

Hi Tim,

Actually I was able to find it using the java api client using this command:

java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["10473108253681171589"]}' --pretty_print | less Then in the the metadata section you will see listed as GRCh37 like this:

... "metadata" : [ { "key" : "FORMAT", "id" : "DS", "type" : "float", "number" : "1", "description" : "Genotype dosage from MaCH/Thunder" }, { "key" : "reference", "value" : "GRCh37", "type" : "unknownType" }, { ... Hope it helps, ~p

— Reply to this email directly or view it on GitHub.

pgrosu commented 10 years ago

Yeah, using a graph representation of the variants will definitely be a different approach, and that's the way GA4GH is also going. It will take a bit of work to do it efficiently, but that's the fun part :)

Also below are the steps if you want to do what I did above in Java through R:

1) First paste the following after you perform the authenticate("client_secrets.json"):

library(httr)

getMetaData <- function(datasetIds="10473108253681171589", endpoint="https://www.googleapis.com/genomics/v1beta/",
  pageToken=NULL) {

  # Fetch data from the Genomics API
  body <- list(datasetIds=list(datasetIds), pageToken=pageToken)

  message("Fetching read data page")

  #queryParams <- list(fields="nextPageToken,reads(name,cigar,position,originalBases,flags)")
  queryParams <- list(fields="nextPageToken,variantSets(metadata)")
  queryConfig <- config()
  if (GoogleGenomics:::.authStore$use_api_key) {
    queryParams <- c(queryParams, key=GoogleGenomics:::.authStore$api_key)
  } else {
    queryConfig <- config(token=GoogleGenomics:::.authStore$google_token)
  }
  res <- POST(paste(endpoint, "variantsets/search", sep=""),
    query=queryParams,
    body=rjson::toJSON(body), queryConfig,
    add_headers("Content-Type"="application/json"))
  if("error" %in% names(httr::content(res))) {
    print(paste("ERROR:", httr::content(res)$error$message))
  }
  stop_for_status(res)

  message("Parsing read data page")
  res_content <- httr::content(res)
  res_content
}

dataset_metadata_info <- getMetaData()
dataset_metadata_info$variantSets[[1]]$metadata[[2]]$value

2) Once you run the last two commmands you should see something like this:

> dataset_metadata_info <- getMetaData()
dataset_metadata_info$variantSets[[1]]$metadata[[2]]$value
Fetching read data page
Parsing read data page
> [1] "GRCh37"

~p

deflaux commented 9 years ago

This is a great feature request and is something we should do soon.

For reads, its the referenceSetId which can then be used to look up the reference set

java -jar target/genomics-tools-client-java-v1beta2.jar getreadgroupset --id CMvnhpKTFhDnk4_9zcKO3_YB --pretty_print
...
    "id" : "ChhDTXZuaHBLVEZoRG5rNF85emNLTzNfWUIQAQ",
    "name" : "ERR016214",
    "predictedInsertSize" : 465,
    "referenceSetId" : "EOSt9JOVhp3jkwE",
    "sampleId" : "HG00261"
...

For variants, it comes from the VCF the header and we can get it from the variant set metadata.

java -jar target/genomics-tools-client-java-v1beta2.jar getvariantset --id 3049512673186936334 --pretty_print
...
 }, {
    "key" : "reference",
    "type" : "UNKNOWN_TYPE",
    "value" : "file:///illumina/development/Isis/Genomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFASTA/genome.fa"
  }, {
...

Per @calbach down the road we should have https://github.com/googlegenomics/api-client-java/issues/66

Bioconductor / GoogleGenomics

genome reference assembly extraction from variants/reads objects? #13