Decouple geom and attr spec logic away from R, towards generic JSON

OHDSI / GIS

https://ohdsi.github.io/GIS

Apache License 2.0

10 stars 9 forks source link

Decouple geom and attr spec logic away from R, towards generic JSON #285

Open kzollove opened 12 months ago

kzollove commented 12 months ago

[ ] Determine generic JSON specification definition
[ ] Create R functionality to translate
[ ] Create crosswalk between JSON specs and language specific (R, Py, Bash, SQL)
[ ] Update gaiaR functionality

rtmill commented 7 months ago

TODO:

Elaboration of SQL approach

Robert Miller (Guest) can you update, use this issue as necessary. I would suggest high level approach, but use that ticket however you'd like to organize around this effort

Example common complexities to be addressed:

Data Source:

geom_local_epsg=sf::st_crs(staged, paramaters=TRUE)$epsg)"]}
geom_name=dplyr::select(sf::st_drop_geometry(staged), n = if('NAME' %in% colnames(staged)) 'NAME' else 'NAMELSAD')$n
geom_local_value=sf::st_as_binary(sf::st_as_sf(staged, coords=c('Latitude', 'Longitude'))$geometry

rtmill commented 7 months ago

and similar to above, example complexities for attributes

Variable source: (same example with two pieces) 1) ["dplyr::filter(staged,Defining Parameter=='Ozone')", 2) "dplyr::mutate(staged,geom_join_column=paste0(stringr::str_pad(State Code,width=2,pad=0),``stringr::str_pad(County Code`,width=3,pad=0)),

another (handling hard coding in general: ... mutate(staged,geom_join_column=FIPS, attr_concept_id=2000000001, attr_start_date=as.Date('2018-01-01'),attr_end_date=as.Date('2018-12-31'),

rtmill commented 7 months ago

Third item to specify:

Lay out implications of staging source data in a database (postgis) rather than current approach of keeping in memory

general approach to staging tables; (?) add parameter to specify whether staged source data is persisted or wiped after
pros and cons of creating requirement that source data is already in a database table; dependency on this for translation processes
- what parameters need to be routinely provided for ogr to ensure data is ingested using consistent data types. how complex of a task is this?
- significance when pulling data from a larger source incrementally, e.g. from APIs, and how to account for that in our staging design

Fourth item: adding clarity on the "phases" that were mentioned and how the specific functionality falls under each 1) Ingestion (CLI, ogr, others) 1) Translation (can we do this comprehensively in SQL?) 1) Extraction/population of exposure occurrence (arguably out of scope for this conversation)