deepayan / nhanes-postgres

A docker container for NHANES data inside a PostgreSQL DBMS
0 stars 0 forks source link

NHANES in Postgres

This repository contains R code and a Docker image definition that facilitate pulling the CDC's NHANES data files into a SQL DBMS. It is inspired by the this project, retaining its design but using a different implentation.

The image is derived from rocker/tidyverse, which is an Ubuntu GNU/Linux image containing R and RStudio Server, and adds a PostgreSQL database containing data from NHANES.

Quick start

To run the current pre-built image, install docker and run

docker run --rm --name nhanes-pg -d \
    -p 8787:8787 \
    -p 2200:22 \
    -p 5432:5432 \
    -e 'CONTAINER_USER_USERNAME=test' \
    -e 'CONTAINER_USER_PASSWORD=test' \
    deepayansarkar/nhanes-postgresql:0.10.1

To map a local directory so that it becomes accessible within the container, add

         -v <LOCAL_DIRECTORY>:/HostData \

This is the easiest way to get started. The various -p flags exposes useful ports in the container:

Building the docker image

Outline

In principle, we can build the docker image using the following, but for the recommended approach, see the version using time below.

docker build --progress plain --shm-size=1024M --platform=linux/amd64 --tag nhanes-postgres -f Container/Dockerfile .

In principle, one could also build on ARM64 (typically on an Apple Silicon machine) using the following,

docker build --progress plain --shm-size=1024M --platform=linux/arm64 --tag nhanes-postgres -f Container/Dockerfile .

but this is not currently possible due to limitations of the parent image (rocker/tidyverse).

Specifying data source

A major part of the build process is to collect NHANES data (specifically, the raw SAS format tables, the HTML docs, and the codebook other metadata extracted from these docs) and insert them into a database. Although these could be obtained directly from the NHANES website maintained by CDC, we prefer to use an intermediate snapshot with a well-defined collection date. This avoids any ambiguity regarding which datasets (and which versions thereof) are included in the docker image.

From the perspective of the build process, the contents of this snapshot are expected to be available through a web server. This serves two purposes: (a) In principle, the docker build can work without external dependencies as long as a reliable internet connection is available. (b) Accessing web resources is easy during the docker build process, whereas accessing local files in the build machine is not as easy.

However, depending on web access has the disadvantage that it slows down the build process. To avoid this, we can check out the snapshot repository and serve it locally by running an appropriate local server. One implementation, using the R httpuv package, is available here.

The code below currently assumes that a local server has been initiated using this code.

FIXME: Allow this to be controlled via an environment variable.

Versioning

Before starting, make sure to edit the files COLLECTION_DATE and CONTAINER_VERSION if necessary. The COLLECTION_DATE should be a copy of the corresponding file in the snapshot being used to populate the database.

Building

It may be useful to redirect the console output to a log file, as it will contain informative messages about failures. This may be done using something like

export CDATE=`cat COLLECTION_DATE`
export CVERSION=`cat CONTAINER_VERSION`
time docker build --progress plain --shm-size=2048M --platform=linux/amd64 --tag nhanes-postgres --build-arg COLLECTION_DATE=${CDATE} --build-arg CONTAINER_VERSION=${CVERSION} -f Container/Dockerfile . &> build.log

To upload to docker hub:

echo ${CVERSION}
docker tag nhanes-postgres <user>/nhanes-postgresql:${CVERSION}
docker push <user>/nhanes-postgresql:${CVERSION}

Running

Run the locally built image as before, replacing the image name:

docker run --rm --name nhanes-pg -d \
         -p 8787:8787 \
         -p 2200:22 \
         -p 5432:5432 \
         -e 'CONTAINER_USER_USERNAME=test' \
         -e 'CONTAINER_USER_PASSWORD=test' \
         nhanes-postgres

As before, to map a local directory, add

         -v <LOCAL_DIRECTORY>:/HostData \

Developer notes

Salient features:

Changes

The DB schema and other interface details are largely similar to the SQL Server implementation. However, Postgres has one big difference: unquoted table and column names in SQL statements are automatically converted to lowercase, which means we need to change how queries are constructed.

This is mainly an issue with the hard-coded SQL queries in nhanesA. Work on this is being done in the postgres-backend branch of https://github.com/cjendres1/nhanes.

Postgres vs MS SQL Server

Use RPostgres to establish connections, with port 5432 by default. For example:

DBI::dbConnect(RPostgres::Postgres(),
               dbname = "NhanesLandingZone",
               host = "localhost",
               port = 5432L,
               password = "NHAN35",
               user = "sa")

There is no need to install separate drivers, as in the case of Microsoft SQL Server.

Efficient DB operations

See https://www.postgresql.org/docs/current/populate.html.

We follow two of these recommendations explicitly:

We should explore others as well.

Installing pgAdmin

See https://www.pgadmin.org/download/

Other Thoughts