PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
203 stars 235 forks source link

Integer64 and the RPostgres package #2818

Open bcow opened 3 years ago

bcow commented 3 years ago

Over two years ago, @ashiklom opened up PR https://github.com/PecanProject/pecan/pull/2318 raising an issue about the fact that connections to the database can potentially return values of class integer64 instead of integer. Without intervening code to catch and convert them, the returned values become meaningless and break the code downstream. For example, until they are converted to integer or numeric, they also cannot be converted to character and break logger messages and further database queries. As of now, the only way I can find to consistently ensure bigints are handled correctly is to use:

con <- dbConnect(
  RPostgres::Postgres(),  # specifically use Postgres
  bigint = "integer",     # might be optional, but I argue should be set explicitly
  ... # all other settings unchanged
)

Right now, the betyConnect function has evaded this issue by hardcoding driver to "Postgres" https://github.com/PecanProject/pecan/blob/ee38544e1c995f47073019bd31b3179e7e64991a/base/db/R/query.dplyr.R#L26 and then calling db.open but this isn't the place the problem should be addressed. There are multiple functions that call db.open directly (for example many of the db.query functions that are called in met.process and that's one reason it's breaking!)

Furthermore, in db.open, if no driver is specified, the driver is set to PostgreSQL, not Postgres. https://github.com/PecanProject/pecan/blob/ee38544e1c995f47073019bd31b3179e7e64991a/base/db/R/utils_db.R#L203 You will see that many of the pecan.xml files that are generated through using the browser interface have PostgreSQL set as driver under settings$database$bety.

The bigint = "integer" argument does not apparently work with PostgreSQL, so an alternative approach would have to be taken to make sure all integer64 were converted to integer. I did write a simple conversion function, which I can add as a PR, but that's adding functions (and would also have to include the use of an additional package, the bit64 package.)

What are people's thoughts on how to proceed? Should Postgres be the default in db.open? I don't fully understand the difference between the two packages and that's where I need other people's input.

ashiklom commented 3 years ago

Thanks for the report!

I would argue that Postgres should be the default. RPostgres is actively maintained and fully DBI compliant, so it keeps up with all the dplyr database stuff and, more importantly, has important features like the ability to run parameterized queries. RPostgreSQL is has far fewer features, and AFAICT hasn't been updated in the last 4 years.

I tried to address some of these integer64 issues a while ago in https://github.com/PecanProject/pecan/pull/2318, but it seems I may have missed some.

they also cannot be converted to character and break logger messages

The way I've always addressed the string part of this is by always wrapping ID fields (as in the PR above) is by wrapping them in a call to format. I think that works because it dispatches to something like a format.integer64 (or maybe format.bit64) method that understands long integers.

...and further database queries

Where we have to use string interpolation, we should just make sure to use format. That said, we should really use parameterized queries / prepared statements instead -- they're safer, more robust to type mismatch like this, I think more performant, and more elegant. Our db.query function already has support for this.

All the higher-order dplyr-based queries should be immune to these issues.

bcow commented 3 years ago

Great I'll also put in my vote for making Postgres the default. I'll wait for more input, but once we come to a consensus, I can take on making these changes. As I clean up the benchmarking code and changes I've made to met.process I'll check for the effects it has on all related database interactions, but while I was finishing my dissertation I essentially hardcoded everything to Postgres in db.open so I've unofficially already done that. Also, as I've been going along I've been translating to dplyr / dbplyr whenever possible so once the integer64 problem is fixed, there should be some much more reliable sections of code.

robkooper commented 3 years ago

Some thoughts, i think we pick on or the other, whichever is best supported.

we should modify the correct driver at https://github.com/PecanProject/pecan/blob/develop/base/db/R/utils_db.R#L205 and remove (print warning) to change to the correct version.

Once done we should remove the not used package so it is not installed in the pecan docker image anymore.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 365 days with no activity.

jcpayne commented 1 year ago

I just ran into this problem when executing an SQL command that counted rows in a large table. I'm no expert, but on SO, IRTFM commented that "The integer64 class is created by the bit64 package. Need to recover the data without mangling by using functions from the package that created the integer64 object." But that's very clunky, and adding bigint = "integer" seems to work, though also a bit clunky.