Closed sanchit-saini closed 3 years ago
Hi @sanchit-saini ,
Thanks for the PR. I think there's code in GenomicFeatures (and maybe in other Bioconductor packages) that calls UCSCFeatureDbTableSchema()
and operates on its output. With the proposed change that code will probably break and will need to be fixed too.
H.
So I grep'ed all Bioconductor software package code and the good news is that UCSCFeatureDbTableSchema()
is not used outside the GenomicFeatures package. This makes it a lot easier to deal with the proposed change.
Hi @hpages
I tried to generate mapping using JSON type. It is impossible to distinguish between floating-point and integer numbers from JSON number
type alone. So we are only left with SQL type.
As UCSC REST API provides SQL and JSON mapping, So I looked into its source code and model the SQL to R type mapping logic on it.
API code: https://github.com/ucscGenomeBrowser/kent/blob/v410_base/src/hg/hubApi/apiUtils.c#L113-L167
If anything needs attention, I'm happy to update it.
Hi @hpages, I have implemented the fix can please review it.
Thanks @sanchit-saini.
I'm not fond of having something like sqlTypesToRTypes()
in GenomicFeatures. Seems like the wrong place for this. I wish there was something like this already in the DBI stack. but I can't find it. This would be worth asking to the r-dbi folks. Also the current implementation encapsulates a lot of knowledge which is a potential maintenance hassle. The mapping between SQL types and R types must surely be encoded somewhere because packages like RMySQL and RSQLite clearly know about it. How about using their knowledge instead e.g. by doing something like this (using RSQLite here seems to be the most convenient/efficient path):
map_SQLtypes_to_Rtypes <- function(SQLtypes)
{
stopifnot(is.character(SQLtypes))
SQLtypes <- gsub(" *unsigned", "", SQLtypes)
col_defs <- sprintf("col%d %s", seq_along(SQLtypes), SQLtypes)
sql <- sprintf("CREATE TABLE dummy (%s)", paste0(col_defs, collapse=", "))
conn <- dbConnect(SQLite())
on.exit(dbDisconnect(conn))
dbExecute(conn, sql)
dummy <- dbReadTable(conn, "dummy")
col_types <- vapply(dummy, function(col) class(col)[[1L]], character(1))
setNames(col_types, SQLtypes)
}
Then:
SQLtypes <- c("varchar(255)",
"int(10) unsigned",
"double",
"decimal",
"char(1)",
"tinyint(10)",
"mediumint(10)",
"int(10)",
"longblob",
"binary",
"bit",
"datetime",
"tetypet",
"smallint(5) unsigned NOT NULL")
map_SQLtypes_to_Rtypes(SQLtypes)
# varchar(255) int(10) double
# "character" "integer" "numeric"
# decimal char(1) tinyint(10)
# "numeric" "character" "integer"
# mediumint(10) int(10) longblob
# "integer" "integer" "blob"
# binary bit datetime
# "numeric" "numeric" "numeric"
# tetypet smallint(5) NOT NULL
# "numeric" "integer"
This will fail for some of the MySQL types not recognized by SQLite (e.g. set
) but I think that in the context where map_SQLtypes_to_Rtypes()
will be used (i.e. UCSCFeatureDbTableSchema()
) these problematic types won't be seen so we should be good.
Two more things:
Why the direct slot access (res@listData$SQL.type
) to extract a column when you can use the standard data.frame API for that (e.g. res$SQL.type
or res[["SQL.type"]]
or res[ , "SQL.type"]
)?
Various examples in the man page for makeFeatureDbFromUCSC()
trigger the deprecation warning:
supported_tables <- supportedUCSCFeatureDbTables(genome="mm10", track="qPCR Primers")
# Warning message:
# In .local(x, ...) :
# 'track' parameter is deprecated now you go by the 'table' instead
# Use ucscTables(genome, track) to retrieve the list of tables for a track
fdb <- UCSCFeatureDbTableSchema(genome="mm10", track="qPCR Primers", tablename="qPcrPrimers")
Warning messages:
# 1: In .local(x, ...) :
# 'track' parameter is deprecated now you go by the 'table' instead
# Use ucscTables(genome, track) to retrieve the list of tables for a track
# 2: In .local(x, ...) :
# 'track' parameter is deprecated now you go by the 'table' instead
# Use ucscTables(genome, track) to retrieve the list of tables for a track
Please make sure to address them all.
Thanks again for your work on this.
Hi @hpages, Thanks for the detailed illustration of another approach to extract type information. I have updated the type mapping approach as you suggested, which looks more elegant. And along with address both of your concerns. Can you review it again?
Hi @sanchit-saini,
Changes look good but we still have one problem:
> fdb <- makeFeatureDbFromUCSC(genome="mm10", track="qPCR Primers", tablename="qPcrPrimers")
Download the qPcrPrimers table ... Error in handleResponseForOutputTypes(response, tableName, query) :
Unsupported file format : bedDetail
This example is from makeFeatureDbFromUCSC
's man page. It works in release (GenomicFeatures 1.42.1, from BioC 3.12).
Please make sure that all examples in makeFeatureDbFromUCSC
's man page work as expected. Ideally you'd also want to run R CMD build
followed by R CMD check
on the package.
H.
Hi @hpages, It passed all the checks on my system. I investigated a bit to find what caused the failure and find out it is occurring because the binary package is not built against the latest source code in rtracklayer. And the reason for the outdated binary package might be failed checks and errors in the build. I will look into the rtracklayer. For now, you can verify it by following steps:
git clone https://git.bioconductor.org/packages/rtracklayer
R CMD INSTALL rtracklayer
R CMD build GenomicFeatures
R CMD check GenomicFeatures_1.43.3.tar.gz
I'm on Linux (Ubuntu) so everything I install gets installed from source. I installed the latest available rtracklayer with BiocInstaller::install("rtracklayer")
:
> library(rtracklayer)
> sessionInfo()
R Under development (unstable) (2021-02-05 r79941)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /home/hpages/R/R-4.1.r79941/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79941/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] rtracklayer_1.51.4 GenomicRanges_1.43.3 GenomeInfoDb_1.27.6
[4] IRanges_2.25.6 S4Vectors_0.29.7 BiocGenerics_0.37.1
loaded via a namespace (and not attached):
[1] rstudioapi_0.13 XVector_0.31.1
[3] zlibbioc_1.37.0 GenomicAlignments_1.27.2
[5] BiocParallel_1.25.4 lattice_0.20-41
[7] rjson_0.2.20 tools_4.1.0
[9] grid_4.1.0 SummarizedExperiment_1.21.1
[11] Biobase_2.51.0 matrixStats_0.58.0
[13] yaml_2.2.1 crayon_1.4.1
[15] BiocIO_1.1.2 Matrix_1.3-2
[17] GenomeInfoDbData_1.2.4 restfulr_0.0.13
[19] bitops_1.0-6 RCurl_1.98-1.2
[21] DelayedArray_0.17.9 compiler_4.1.0
[23] MatrixGenerics_1.3.1 Biostrings_2.59.2
[25] Rsamtools_2.7.1 XML_3.99-0.5
Maybe what happens is that rtracklayer needs a version bump? Looks like there were several important changes after the last version bump from Jan 13. This would explain why my rtracklayer 1.51.4 is not the same as your rtracklayer 1.51.4.
H.
Maybe what happens is that rtracklayer needs a version bump?
Yes, you are right. "Changes made to GIT without a corresponding version bump do not propagate to the repository visible to BiocManager::install()" from https://bioconductor.org/developers/how-to/version-numbering/ Thanks for suggesting it. I did not know about this. I need to read the Bioconductor developer resource more thoroughly.
I can confirm that after installing the latest rtracklayer from git.bioconductor.org, the makeFeatureDbFromUCSC
examples work again and this patch passes R CMD check
cleanly (except for one warning that is unrelated to the patch). So as soon as rtracklayer's version gets bumped, we'll be good to go. Thanks!
It is done, we are good to go now.
Thanks!
Hello @hpages
GenomicFeatures package was affected in two places because of recent rtracklayer changes: 1.
checkTable()
which is fixed with https://github.com/Bioconductor/GenomicFeatures/commit/efd665c82b86a8f060627234f34b435a3858d4452.
UCSCFeatureDbTableSchema()
usesucscSchema()
to get example data of a table then used to guess R type from the example data. AsucscSchema()
now utilizes UCSC REST API that does not provide any example data, but it provides a JSON type that might be helpful to generate a mapping between R types. That might be easy compared to the SQL type as the javascript type system is relatively simple compared to SQL.Example:
What are your thoughts about this approach?