OHDSI / ResultModelManager

RMM is an R package designed to handle common ohdsi results data management functions by providing a common API for data model migrations and definitions
https://ohdsi.github.io/ResultModelManager/
Apache License 2.0
3 stars 3 forks source link

Using readr coltypes on input #66

Closed azimov closed 3 weeks ago

azimov commented 1 month ago

Reading coulmn types has created a number of downstream issues for inserting data that may be resolved using the column sepecifications. Therefore, It has been proposed that we use the readr::col_ convertors to do this, which may resolve issues. When experimenting with this in the past I found that a number of packages are inconsistent with how they export the fields (most notably CohortDiagnostics, where a lot of the input cleanups come from).

The ideal approach would be to fix the packages, but the current insert handles a surprising number of edge cases that have arisen over the years resulting in a messy but largely resilient solution.

Proposed solutions

A test branch will be created that implements the parser from readr:

The documentation is fairly striaghtforward.

col_types
One of NULL, a cols() specification, or a string. See vignette("readr") for more details.
If NULL, all column types will be inferred from guess_max rows of the input, interspersed throughout the file. This is convenient (and fast), but not robust. If the guessed types are wrong, you'll need to increase guess_max or supply the correct types yourself.
Column specifications created by list() or cols() must contain one column specification for each column. If you only want to read a subset of the columns, use cols_only().
Alternatively, you can use a compact string representation where each character represents one column:
c = character
i = integer
n = number
d = double
l = logical
f = factor
D = date
T = date time
t = time
? = guess
_ or - = skip

Solution 1 use a string mapping

To do this we we will need a conversion from the database types to the native r types (e.g. varchar = c, integer = i, date = D, datetime = t... etc)

The problem here is that there are some cases where the data model is wrong (e.g. integer vs bigint) and it really doesn't matter because the database is largely type safe (e.g. inserting 22.0 into an int field is fine on most platforms and will only error if you insert 22.0001).

I'm also hesitant to do anything with dates or datetimes because the r conversion tends to screw up.

I see no reason to use logical or double even if the db fields match these as numeric is fine. Factors will also be ignored.

solution 2 if its a varchar it a char, otherwise guess

This is less messy but will likely solve edge cases around numbers. The main issue here is with empty result sets, but these will be skipped on insert anyway.