Bioconductor / GoogleGenomics

[DEPRECATED] An R package for Google Genomics API queries.
Apache License 2.0
44 stars 23 forks source link

seqlevelsStyle oddities #14

Closed ttriche closed 10 years ago

ttriche commented 10 years ago

This works OK:

getVariants(datasetId = "10473108253681171589", chromosome = "22", start = 16051400, end = 16051500, oneBasedCoord = FALSE, fields = NULL, converter = c)

Fetching variant page

Variants are now available

...

but this does not:

variant <- getVariants(datasetId = "10473108253681171589", chromosome = "chr22", start = 16051400, end = 16051500, oneBasedCoord = FALSE, fields = NULL, converter = c)

Fetching variant page

Variants are now available

NULL

This is OK:

getReadData(chromosome = "chr19", start = 45411941, end = 45412079, readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=", endpoint = "https://www.googleapis.com/genomics/v1beta/", pageToken = NULL)

Fetching read data page

Parsing read data page

Read data is now available

and this is not:

getReadData(chromosome = "19", start = 45411941, end = 45412079, readsetId = "CJ_ppJ-WCxDxrtDr5fGIhBA=", endpoint = "https://www.googleapis.com/genomics/v1beta/", pageToken = NULL)

This sure seems like a job for GenomeInfoDb. Am I missing something obvious? Also, specifying the reference assembly for each reference sequence (chromosome) would be nice.

pgrosu commented 10 years ago

Hi Tim,

So for that dataset, the referenceName are all numbers denoting the chromosome, except for X, Y and MT. Below is an example using the Java API:

$ java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["10473108253681171589"]}' --pretty_print | grep referenceName
      "referenceName" : "1",
      "referenceName" : "10",
      "referenceName" : "11",
      "referenceName" : "12",
      "referenceName" : "13",
      "referenceName" : "14",
      "referenceName" : "15",
      "referenceName" : "16",
      "referenceName" : "17",
      "referenceName" : "18",
      "referenceName" : "19",
      "referenceName" : "2",
      "referenceName" : "20",
      "referenceName" : "21",
      "referenceName" : "22",
      "referenceName" : "3",
      "referenceName" : "4",
      "referenceName" : "5",
      "referenceName" : "6",
      "referenceName" : "7",
      "referenceName" : "8",
      "referenceName" : "9",
      "referenceName" : "MT",
      "referenceName" : "X",
      "referenceName" : "Y",
$

~p

pgrosu commented 10 years ago

Hi Tim,

Regarding the reads, the dataset it belongs to is 383928317087. Tthe referenceName for the dataset that that ReadSet ID belongs to contains only referenceName 1, as shown below:

$ java -jar target/genomics-tools-client-java-v1beta.jar custom --custom_endpoint "variantsets/search" --custom_method POST --custom_body '{"datasetIds":["383928317087"]}' --pretty_print | grep referenceName
      "referenceName" : "1",
$

Hope it helps, Paul

deflaux commented 10 years ago

I believe this was fixed with https://github.com/googlegenomics/api-client-r/pull/15

pgrosu commented 10 years ago

Nicole, I'm not so sure about that :( Below are my tests after updating to the latest version. Tim, should I have tried my tests in a different way?

> library(GoogleGenomics)
Loading required package: GenomicAlignments
Loading required package: BiocGenerics
...
GoogleGenomics: Do not forget to authenticate.
        Use GoogleGenomics::authenticate(file="secretsFile.json").
        See method documentation on how to obtain the secretsFile.

> GoogleGenomics::authenticate(file="client_secrets.json")
[1] "Configured OAuth token."
> vars.22 <- getVariants(datasetId = "10473108253681171589", chromosome = "22", start = 16051400, end = 16051500, fields = NULL, converter = c)
Fetching variants page
Variants are now available
> vars.chr22 <- getVariants(datasetId = "10473108253681171589", chromosome = "chr22", start = 16051400, end = 16051500, fields = NULL, converter = c)
Fetching variants page
Variants are now available
> require(testthat)
Loading required package: testthat
> expect_equal(vars.22, vars.chr22)
Error: vars.22 not equal to vars.chr22
target is NULL, current is list
> reads.chr19 <- getReads(readsetId="CJ_ppJ-WCxDxrtDr5fGIhBA=", chromosome="chr19", start=45411941, end=45412079, fields=NULL, converter=c)
Fetching reads page
Reads are now available
> reads.19 <- getReads(readsetId="CJ_ppJ-WCxDxrtDr5fGIhBA=", chromosome="19",
start=45411941, end=45412079, fields=NULL, converter=c)+
Fetching reads page
[1] "ERROR: The given readsets are not aligned to any reference with sequenceName \"19\". Did you mean \"chr19\"? Wanted one of [\"chrM\" \"chr1\" \"chr2\" \"chr3\" \"chr4\" \"chr5\" \"chr6\" \"chr7\" \"chr8\" \"chr9\" \"chr10\" \"chr11\" \"chr12\" \"chr13\" \"chr14\" \"chr15\" \"chr16\" \"chr17\" \"chr18\" \"chr19\" \"chr20\" \"chr21\" \"chr22\" \"chrX\" \"chrY\" \"*\"]"
Error in getSearchPage("reads", body, fields, pageToken) :
  client error: (400) Bad Request
> expect_equal(reads.chr19, reads.19)
Error in compare(expected, actual, ...) : object 'reads.19' not found
>

Paul

ttriche commented 10 years ago

Why would you expect them to be equal? E.g. If I write SQL for

SELECT * FROM Variants WHERE Reference LIKE '22';

SELECT * FROM Reads WHERE Reference LIKE 'chr22';

You wouldn't realistically expect the database to know that the clause means the same thing. (With the variation reference standard, it will get much more exciting, too!)

Furthermore, unless I know a priori what all synonymous versions of "chr22" exist in Reference, I can't make any guarantees if the code tries to guess what the user "means". I don't, so I can't, so the code doesn't.

All seqlevelsStyle() does is to standardize the seqlevels in the returned BioC data structures. I have no control over how the genomics API stores the reference; workaround would require some tinkering on their end. (Furthermore, I'm not sure it would be such a great idea, given that GRCh37/38 builds from Genbank like to use wacky things like ch1_1 and so forth in the reference names, and it will only get wackier from here).

Anyways. The patches in #15 just try to ensure that BioC data structures have a consistent, UCSC-style reference sequence (chromosome) representation. Regardless of whether reads/variants are referenced against "ch22_22" or "22" or "chr22", everything in BioC works better by default if your seqlevels are UCSC style. So I made that the default.

Maybe I should submit the example code that I sent Cassie and Nicole, as a unit test itself. But, then the tests would require BSgenome.Hsapiens.

--t

On Nov 4, 2014, at 3:25 PM, Paul Grosu notifications@github.com wrote:

Nicole, I'm not so sure about that :( Below are my tests after updating to the latest version. Tim, should I have tried my tests in a different way?

library(GoogleGenomics) Loading required package: GenomicAlignments Loading required package: BiocGenerics ... GoogleGenomics: Do not forget to authenticate. Use GoogleGenomics::authenticate(file="secretsFile.json"). See method documentation on how to obtain the secretsFile.

GoogleGenomics::authenticate(file="client_secrets.json") [1] "Configured OAuth token." vars.22 <- getVariants(datasetId = "10473108253681171589", chromosome = "22", start = 16051400, end = 16051500, fields = NULL, converter = c) Fetching variants page Variants are now available vars.chr22 <- getVariants(datasetId = "10473108253681171589", chromosome = "chr22", start = 16051400, end = 16051500, fields = NULL, converter = c) Fetching variants page Variants are now available require(testthat) Loading required package: testthat expect_equal(vars.22, vars.chr22) Error: vars.22 not equal to vars.chr22 target is NULL, current is list reads.chr19 <- getReads(readsetId="CJ_ppJ-WCxDxrtDr5fGIhBA=", chromosome="chr19", start=45411941, end=45412079, fields=NULL, converter=c) Fetching reads page Reads are now available reads.19 <- getReads(readsetId="CJ_ppJ-WCxDxrtDr5fGIhBA=", chromosome="19", start=45411941, end=45412079, fields=NULL, converter=c)+ Fetching reads page [1] "ERROR: The given readsets are not aligned to any reference with sequenceName \"19\". Did you mean \"chr19\"? Wanted one of [\"chrM\" \"chr1\" \"chr2\" \"chr3\" \"chr4\" \"chr5\" \"chr6\" \"chr7\" \"chr8\" \"chr9\" \"chr10\" \"chr11\" \"chr12\" \"chr13\" \"chr14\" \"chr15\" \"chr16\" \"chr17\" \"chr18\" \"chr19\" \"chr20\" \"chr21\" \"chr22\" \"chrX\" \"chrY\" \"*\"]" Error in getSearchPage("reads", body, fields, pageToken) : client error: (400) Bad Request expect_equal(reads.chr19, reads.19) Error in compare(expected, actual, ...) : object 'reads.19' not found

Paul

— Reply to this email directly or view it on GitHub.

pgrosu commented 10 years ago

Tim - I completely agree, and I would have been very amazed if the code could detect and reformat to the stored referenceName, but I wanted to check with you in case I tested things incorrectly. It would be great to include the example code that performs the tests - this way things are consistent with the changes and would manage user expectations.

I am always in favor of controlled vocabulary, so I feel a remapping function would be helpful in case folks analyze across multiple datasets. But again that would be something that should be added to the core API, since that is something that the original data submitter would need to provide.

Nicole, what do you think? It's not that much work, but it will save a lot of time in the long run.

ttriche commented 10 years ago

example code posted as #21

Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science

On Tue, Nov 4, 2014 at 4:40 PM, Paul Grosu notifications@github.com wrote:

Tim - I completely agree, and I would have been very amazed if the code could detect and reformat to the stored referenceName, but I wanted to check with you in case I tested things incorrectly. It would be great to include the example code that performs the tests - this way things are consistent with the changes and would manage user expectations.

I am always in favor of controlled vocabulary, so I feel a remapping function would be helpful in case folks analyze across multiple datasets. But again that would be something that should be added to the core API, since that is something that the original data submitter would need to provide.

Nicole, what do you think? It's not that much work, but it will save a lot of time in the long run.

— Reply to this email directly or view it on GitHub https://github.com/googlegenomics/api-client-r/issues/14#issuecomment-61741653 .

pgrosu commented 10 years ago

Thanks Tim - just saw it and commented :)

deflaux commented 9 years ago

@ttriche thanks for the clarification. Your patch made the referenceName consistent upon conversion to BioConductor objects. But the remaining issue is that the Google Genomics APIs currently store the referenceName as it is found within the source data. We do not currently normalize it or support a server-side alias for queries via the Google Genomics API.

This isn't something we can address in a clean way client-side, so I'll leave this R client issue closed. But please add your feedback to the GA4GH discussions on referenceName naming conventions. Thanks again!

ttriche commented 9 years ago

Indeed. It would be nice to have a similar canonicalization mechanism in the API, but that's far beyond the scope of this particular patch. If and when a canonical mechanism for reference sequence naming is added to the GA4GH APIs, this issue could be revisited at that time. I'll keep an eye out for that topic :-)