geomarker-io / riseup_geomarker_pipeline

A Multi-Modal Geomarker Pipeline for Assessing the Impact of Social, Economic, and Environmental Factors on Pediatric Hospitalization
1 stars 0 forks source link

RISEUP Geomarker Pipeline

This is a pipeline for appending place- and date- based geomarker data from multiple data sources at multiple geographic and temporal resolutions, with a specific focus on Hamilton County, Ohio:

%%{init: { "fontFamily": "arial" } }%%

graph TD

classDef id fill:#fff,stroke:#000,stroke-width:1px;
classDef input fill:#fff,stroke:#000,stroke-width:1px;
classDef data fill:#e8e8e8,stroke:#000,stroke-width:1px;

hosp[(hospitalizations)]:::input
hosp --> date(date related to address):::id
hosp -->  address(cleaned and parsed \naddress components):::input

daily("weather, air quality, pollen/mold, \nseasonality, instructional days"):::data

date --join by date --> daily

address --street range \ngeocoding--> geocode(geocoded \ncoordinates):::id
address --probabilistic \nmatching--> parcel(parcel \nidentifier):::id

pd(point-level \ndata resources, e.g.,\n greenspace, traffic,\n air pollution, \nhospital access):::data

tract_data(tract-level data resources, e.g.,\nChild Opportunity Index,\n Material Deprivation Index):::data
geocode --spatial intersection \nwith census tract --> tract_data
geocode -- geomarker \nassessment library --> pd

parcel_data(tax auditor databases,\nhousing code violations,\nhousing health pedigree):::data
parcel --join by parcel identifier--> parcel_data

Data

See the metadata for information about specific data elements and sources.

Notes:

Time Data Availability Notes

Data Min Date Max Date Frequency Data Availability Lag
AQI 2015-01-01 2022-10-01 daily 6 months
weather 2015-01-01 2022-09-30 daily 6 months
pollen and mold 2021-02-17 2022-12-09 random (skips weekends, some random days missing) unknown
shotspotter (only for Avondale, E. Price Hill, and W. Price Hill) 2017-08-16 2023-01-03 daily daily

Running & Developing

  1. Clone github repository to destination; manually move input health data into place (DR1767_r2.csv)
  2. Install pak (install.packages("pak", repos = sprintf("https://r-lib.github.io/p/pak/stable/%s/%s/%s", .Platform$pkgType, R.Version()$os, R.Version()$arch)))
  3. Install all packages from DESCRIPTION file by running pak::pak() or remotes::install_deps() in the project root from R. (If you are on a linux machine, speed up installation to use binaries hosted by Posit, by setting options("repos" = c("CRAN" = "https://packagemanager.rstudio.com/all/__linux__/jammy/latest")), substituting jammy for your specific linux version.)
  4. Install required python libraries. (Use reticulate::py_config() to check on available python environments):
    reticulate::py_install("usaddress", pip = TRUE)
    reticulate::py_install("dedupe", pip = TRUE)
    reticulate::py_install("dedupe-variable-address", pip = TRUE)
  5. Use make to create targets defined in Makefile or make tdr to create the final output as a tabular data resource. docker is required to run the geocode and geomark targets.

To specify the python executable to use for the pipeline without using R, set an environment variable in the shell being used to call make:

export RETICULATE_PYTHON=~/.virtualenvs/r-parcel/bin/python
make all

Notes: