Bioconductor / GenomicFeatures

Query the gene models of a given organism/assembly
https://bioconductor.org/packages/GenomicFeatures
26 stars 12 forks source link

Affected changes from the new UCSC Table Query interface of rtracklayer package #29

Closed sanchit-saini closed 3 years ago

sanchit-saini commented 3 years ago

Hello @hpages

GenomicFeatures package was affected in two places because of recent rtracklayer changes: 1. checkTable() which is fixed with https://github.com/Bioconductor/GenomicFeatures/commit/efd665c82b86a8f060627234f34b435a3858d445

2. UCSCFeatureDbTableSchema() uses ucscSchema() to get example data of a table then used to guess R type from the example data. As ucscSchema() now utilizes UCSC REST API that does not provide any example data, but it provides a JSON type that might be helpful to generate a mapping between R types. That might be easy compared to the SQL type as the javascript type system is relatively simple compared to SQL.

Example:

> library(rtracklayer)
> x <- ucscSchema(ucscTableQuery("hg38", table = "gold"))
> x[c("field", "JSON.type", "SQL.type")]

UCSCSchema table 'gold' with 45830 rows and 3 columns
DataFrame with 10 rows and 3 columns
         field   JSON.type         SQL.type
   <character> <character>      <character>
1          bin      number      smallint(6)
2        chrom      string     varchar(255)
3   chromStart      number int(10) unsigned
4     chromEnd      number int(10) unsigned
5           ix      number          int(11)
6         type      string          char(1)
7         frag      string     varchar(255)
8    fragStart      number int(10) unsigned
9      fragEnd      number int(10) unsigned
10      strand      string          char(1)

What are your thoughts about this approach?

hpages commented 3 years ago

Hi @sanchit-saini ,

Thanks for the PR. I think there's code in GenomicFeatures (and maybe in other Bioconductor packages) that calls UCSCFeatureDbTableSchema() and operates on its output. With the proposed change that code will probably break and will need to be fixed too.

H.

hpages commented 3 years ago

So I grep'ed all Bioconductor software package code and the good news is that UCSCFeatureDbTableSchema() is not used outside the GenomicFeatures package. This makes it a lot easier to deal with the proposed change.

sanchit-saini commented 3 years ago

Hi @hpages
I tried to generate mapping using JSON type. It is impossible to distinguish between floating-point and integer numbers from JSON number type alone. So we are only left with SQL type.

As UCSC REST API provides SQL and JSON mapping, So I looked into its source code and model the SQL to R type mapping logic on it.

API code: https://github.com/ucscGenomeBrowser/kent/blob/v410_base/src/hg/hubApi/apiUtils.c#L113-L167

If anything needs attention, I'm happy to update it.

sanchit-saini commented 3 years ago

Hi @hpages, I have implemented the fix can please review it.

hpages commented 3 years ago

Thanks @sanchit-saini.

I'm not fond of having something like sqlTypesToRTypes() in GenomicFeatures. Seems like the wrong place for this. I wish there was something like this already in the DBI stack. but I can't find it. This would be worth asking to the r-dbi folks. Also the current implementation encapsulates a lot of knowledge which is a potential maintenance hassle. The mapping between SQL types and R types must surely be encoded somewhere because packages like RMySQL and RSQLite clearly know about it. How about using their knowledge instead e.g. by doing something like this (using RSQLite here seems to be the most convenient/efficient path):

map_SQLtypes_to_Rtypes <- function(SQLtypes)
{
    stopifnot(is.character(SQLtypes))
    SQLtypes <- gsub(" *unsigned", "", SQLtypes)
    col_defs <- sprintf("col%d %s", seq_along(SQLtypes), SQLtypes)
    sql <- sprintf("CREATE TABLE dummy (%s)", paste0(col_defs, collapse=", "))
    conn <- dbConnect(SQLite())
    on.exit(dbDisconnect(conn))
    dbExecute(conn, sql)
    dummy <- dbReadTable(conn, "dummy")
    col_types <- vapply(dummy, function(col) class(col)[[1L]], character(1))
    setNames(col_types, SQLtypes)
}

Then:

SQLtypes <- c("varchar(255)",
              "int(10) unsigned",
              "double",
              "decimal",
              "char(1)",
              "tinyint(10)",
              "mediumint(10)",
              "int(10)",
              "longblob",
              "binary",
              "bit",
              "datetime",
              "tetypet",
              "smallint(5) unsigned NOT NULL")

map_SQLtypes_to_Rtypes(SQLtypes)
#         varchar(255)              int(10)               double 
#          "character"            "integer"            "numeric" 
#              decimal              char(1)          tinyint(10) 
#            "numeric"          "character"            "integer" 
#        mediumint(10)              int(10)             longblob 
#            "integer"            "integer"               "blob" 
#               binary                  bit             datetime 
#            "numeric"            "numeric"            "numeric" 
#              tetypet smallint(5) NOT NULL 
#            "numeric"            "integer" 

This will fail for some of the MySQL types not recognized by SQLite (e.g. set) but I think that in the context where map_SQLtypes_to_Rtypes() will be used (i.e. UCSCFeatureDbTableSchema()) these problematic types won't be seen so we should be good.

Two more things:

  1. Why the direct slot access (res@listData$SQL.type) to extract a column when you can use the standard data.frame API for that (e.g. res$SQL.type or res[["SQL.type"]] or res[ , "SQL.type"])?

  2. Various examples in the man page for makeFeatureDbFromUCSC() trigger the deprecation warning:

    supported_tables <- supportedUCSCFeatureDbTables(genome="mm10", track="qPCR Primers")
    # Warning message:
    # In .local(x, ...) :
    #   'track' parameter is deprecated now you go by the 'table' instead
    #                 Use ucscTables(genome, track) to retrieve the list of tables for a track
    
    fdb <- UCSCFeatureDbTableSchema(genome="mm10", track="qPCR Primers", tablename="qPcrPrimers")
    Warning messages:
    # 1: In .local(x, ...) :
    #   'track' parameter is deprecated now you go by the 'table' instead
    #                 Use ucscTables(genome, track) to retrieve the list of tables for a track
    # 2: In .local(x, ...) :
    #   'track' parameter is deprecated now you go by the 'table' instead
    #                 Use ucscTables(genome, track) to retrieve the list of tables for a track

    Please make sure to address them all.

Thanks again for your work on this.

sanchit-saini commented 3 years ago

Hi @hpages, Thanks for the detailed illustration of another approach to extract type information. I have updated the type mapping approach as you suggested, which looks more elegant. And along with address both of your concerns. Can you review it again?

hpages commented 3 years ago

Hi @sanchit-saini,

Changes look good but we still have one problem:

> fdb <- makeFeatureDbFromUCSC(genome="mm10", track="qPCR Primers", tablename="qPcrPrimers")
Download the qPcrPrimers table ... Error in handleResponseForOutputTypes(response, tableName, query) : 
  Unsupported file format :  bedDetail

This example is from makeFeatureDbFromUCSC's man page. It works in release (GenomicFeatures 1.42.1, from BioC 3.12).

Please make sure that all examples in makeFeatureDbFromUCSC's man page work as expected. Ideally you'd also want to run R CMD build followed by R CMD check on the package.

H.

sanchit-saini commented 3 years ago

Hi @hpages, It passed all the checks on my system. I investigated a bit to find what caused the failure and find out it is occurring because the binary package is not built against the latest source code in rtracklayer. And the reason for the outdated binary package might be failed checks and errors in the build. I will look into the rtracklayer. For now, you can verify it by following steps:

git clone https://git.bioconductor.org/packages/rtracklayer
R CMD INSTALL rtracklayer
R CMD build GenomicFeatures
R CMD check GenomicFeatures_1.43.3.tar.gz
hpages commented 3 years ago

I'm on Linux (Ubuntu) so everything I install gets installed from source. I installed the latest available rtracklayer with BiocInstaller::install("rtracklayer"):

> library(rtracklayer)
> sessionInfo()
R Under development (unstable) (2021-02-05 r79941)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r79941/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79941/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] rtracklayer_1.51.4   GenomicRanges_1.43.3 GenomeInfoDb_1.27.6 
[4] IRanges_2.25.6       S4Vectors_0.29.7     BiocGenerics_0.37.1 

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13             XVector_0.31.1             
 [3] zlibbioc_1.37.0             GenomicAlignments_1.27.2   
 [5] BiocParallel_1.25.4         lattice_0.20-41            
 [7] rjson_0.2.20                tools_4.1.0                
 [9] grid_4.1.0                  SummarizedExperiment_1.21.1
[11] Biobase_2.51.0              matrixStats_0.58.0         
[13] yaml_2.2.1                  crayon_1.4.1               
[15] BiocIO_1.1.2                Matrix_1.3-2               
[17] GenomeInfoDbData_1.2.4      restfulr_0.0.13            
[19] bitops_1.0-6                RCurl_1.98-1.2             
[21] DelayedArray_0.17.9         compiler_4.1.0             
[23] MatrixGenerics_1.3.1        Biostrings_2.59.2          
[25] Rsamtools_2.7.1             XML_3.99-0.5               

Maybe what happens is that rtracklayer needs a version bump? Looks like there were several important changes after the last version bump from Jan 13. This would explain why my rtracklayer 1.51.4 is not the same as your rtracklayer 1.51.4.

H.

sanchit-saini commented 3 years ago

Maybe what happens is that rtracklayer needs a version bump?

Yes, you are right. "Changes made to GIT without a corresponding version bump do not propagate to the repository visible to BiocManager::install()" from https://bioconductor.org/developers/how-to/version-numbering/ Thanks for suggesting it. I did not know about this. I need to read the Bioconductor developer resource more thoroughly.

hpages commented 3 years ago

I can confirm that after installing the latest rtracklayer from git.bioconductor.org, the makeFeatureDbFromUCSC examples work again and this patch passes R CMD check cleanly (except for one warning that is unrelated to the patch). So as soon as rtracklayer's version gets bumped, we'll be good to go. Thanks!

sanchit-saini commented 3 years ago

It is done, we are good to go now.

hpages commented 3 years ago

Thanks!