gowthamrao commented 5 years ago

Geospatial functionality in Atlas: integration of Aegis

Authors: Gowtham Rao, Seng Chan You, Jaehyeong Cho, Andrew Williams, Robert Miller, Pavel Grafkin, Gregory Klebanov

Motivation: During the 2018 OHDSI Symposium, Washington DC USA - J Cho, SC You, K Kim, Y Soh, D Kim, RW Park - presented a software demonstration called 'Application for Epidemiological Geographic Information System (AEGIS) - An open source spatial analysis tool based on CDM' . See AEGIS under related work.

AEGIS, was built to support 5.x version of the OMOP CDM. This version did not have latitude and longitude in location table. AEGIS developers used observation and fact_relationship table to design AEGIS using CDM 5.x. The OMOP CDM 6+ (released October 2018) has two location tables (location and location_history). The location table has fields for latitude and longitude. These new fields may be used to represent precise location of persons, providers or care_sites.

During the presentation, a decision was made to upgrade AEGIS to support CDM 6.x with new location table, and to evaluate if it was possible to integrate AEGIS like functionality, and the work of the OHDSI GIS workgroup into ATLAS.

Background: Spatial epidemiology is the description, analysis or surveillance of a populations health related factors such as medical service, diseases, in relation to other person level or area level factors like demographic, environmental exposure, behavioral determinants, socio-economic indicators, genetic and infectious risk factors. Two types of spatial epidemiology are discussed below.

Descriptive mapping, widely used in spatial epidemiology, is useful for establishing initial hypotheses about the patterns of incidence/prevalence in an area, or the correlation between exposure to specific factors and disease.

Cluster detection is a more advanced statistical method that may reveal geographic clusters, based on patterns and spatial correlation.

AEGIS is described here

gowthamrao commented 5 years ago

Use-cases:

gowthamrao commented 5 years ago

Requirement 1: Area shape selection:

We need a way to select geographic areas (polygon) e.g. country, state, county, postal code area, city, census tract, block etc. This will be the unit of analysis.
These administrative areas maybe predefined/precordinated as standardized vocabulary in omop CDM.
Predefined administrative areas may be related to each other thru hierarchy e.g. area polygon fully within another area polygon, area polygon partly within another area polygon.
Source of shape files: For standardized areas, we need a shapefile source that gives sufficient detail of administrative units, and is open source/free with liberal license. Current version of AEGIS uses shapefiles from GADM https://gadm.org/ GADM supports 3 levels of hierarchy, but has a restrictive license term - that prevents use by the broader community. Alternatives under consideration: TIGER census files (US only), Open Street Map.
In advanced version, custom areas may be defined by the analyst. We need a way to support user defined shape polygons.

gowthamrao commented 5 years ago

Requirement 2: Map tile visualization as background for shapes:

Shapes with background tile: area shapefiles may have background tiles like roads, areas of interest, points of interest, labels, roads etc. e.g. Google maps tiles, Openstreetmap tiles.
Zoom in and Zoom out of areas - with increasing or decreasing detail depending on zoom level.
These background tiles should be local, needing no internet calls. This is because in most instances internet calls are not allowed because of security and privacy rules. For those that allow internet calls, support just be enabled for internet based API calls like Google maps API.

gowthamrao commented 5 years ago

Requirement 3: Cohort definition using location table:

We would like circe-be to be enhanced to support location based criteria. Be able to select administrative areas as a criteria in circe-be + Atlas.
1. Use administrative areas as part of cohort entry event or cohort inclusion criteria.
2. Administrative areas must be derived from latitude and longitude in location table using look-up (point in polygon function, point is person's lat/long - polygon is the administrative area).
3. Administrative area criteria may be specified thru text string search of the field city, state, zip, county, country of location table. This is similar to searching for LOT_NUMBER in drug_exposure as part of circe-be.
4. We would like to calculate the distance of person.location_id from provider.location_id or care_site.location_id as part of cohort definition, e.g. We would like to build a cohort of subject_id who live more than 30 miles/kms from the provider.

Example in Atlas/circe-be: that should be supported at both cohort entry events (qualified events) and inclusion criteria (inclusion events) are

where the care_site.location_id belonged to administrative area e.g. Lat/long was in the selected geographic area (in statement)
where the person.location_id belonged to administrative area e.g. Lat/long was in the selected geographic area (in statement)
where the care_site.location_id was within x units of distance from person.location_id (eucledian distance between two points defined by lat/long)

In all cases above, the location_id is inferred by using the earliest valid location_id for the person_id/care_site_id/provider_id based on interval period of the qualified/inclusion event thru look-up to location_history

gowthamrao commented 5 years ago

Requirement 4: Cohort profiles summary (Visualization)

Once a cohort is constructed in Atlas, it would be nice to visualize the entire cohort on a map under Cohort Definition -> Generation -> View reports.
Two forms of visualization are requested.

Marker/dot distribution map: Each subject_id is a marker on a map (e.g. one-to-one dot distribution map) with marker clustering to (e.g. one-to-many dot density distribution map) we would like the option of clustering the markers using a visualization detailed here. This type of clustering is a form of reactive map visualization, and not a true cluster geospatial analysis.
1. When cluster spread to reveal one single subject, we would like to hover over the marker and get a tool-tip that shows some information about the person such as person_id, age, gender.
2. Click/double-click/right click on that marker should open a new tab that reveals person-profile of that subject. See person profile below.
Density map: This is the process of creating contours on a map based on 2D representation of dots on a map, where each dot is a subject_id. This visualization technique is described Binning is performed similar to histogram, but in a 2D space. Bins may be square or hexagon (hexbin), with color of each bin determined by either count (or for continuous feature per person - sum of the features). Contours are drawn to represent boundaries of similar distributions. D3 examples are here.

gowthamrao commented 5 years ago

Requirement 5: Person profile summary (Visualization)

In person profiles page of Atlas, add functionality of rendering map of geographic location of the person_id being profiled. All we need here is one marker for the person being profiled on a map.

gowthamrao commented 5 years ago

Requirement 6: Enhance cohort characterization using location table/domain:

We would like to add location domain to the tables and graphs in Atlas cohort characterization, i.e. in addition to characterizing a cohorts demographics, condition, procedure, drug etc. we would like to characterize the administrative areas. Example: Table 1 (with count of people by administrative area)

Demographics
	Male	340 (34%)
	Female	660 (66%)
Age
	00 - 10	100 (10%)
	10 - 20	200 (20%)
	20 - 30	140 (14%)
Geographic location (Stage)
	California	100 (10%)
	Illinois	200 (20%)
	South Carolina	300 (30%)

We would also like to characterize the euclidean distance between a subjects loction_id and their providers location_id in miles/kms as continuous value in cohort characterization. E.g. average/distribution distance from provider for each area.

gowthamrao commented 5 years ago

Requirement 7: Cohort + area characterization:

We need to perform cohort characterization by dividing a cohort into multiple strata based on the area the subjects in the cohort belong to, i.e. every cohort + area combination is treated as a separate cohort and cohort characterization is performed.
To achieve this we need a logic to assign geographic area per subject_id using person's latitude and longitude - point inside a polygon logic, where polygon is the administrative area.
Create Choropleth plots for each feature/characteristic - one map for each characteristic. Colors are assigned based on the feature of interest.
1. Binary feature (Yes/No) - One map per feature, yes are reported by area.
2. Categorical/ordinal feature (> 1 level) - number of maps = number of levels in categorical feature.
3. Continuous feature (distribution average) - each areas average is used.

gowthamrao commented 5 years ago

Requirement 8: Choropleth map

We would like to be render characteristics about a cohort, rendered as a Choropleth map. This type of visualization is described here. Its different from heat-map because, heat-map does not use geographical boundaries. D3 examples of Choropleth are hereand here.
We would like to do both cohort characteristics and cohort comparison.
1. If only one cohort is selected: Choropleth is created for the one cohort for each selected characteristic/feature.
2. If two cohorts are selected:
  1. We will use the incident rate analysis framework to specify one cohort as target cohort, and other as outcome cohort - specify an at risk window/study window specification.
  2. Choropleth is created for each of the selected cohorts that meets the incident rate analysis specification.
  3. Choropleth is created for the relationship between the two cohort for the chosen characteristic Percent: (summary characteristic in outcome cohort/summary characteristic in target cohort)*100
Normalization: When one map represents characteristics of a single cohort, we would like the option of normalizing any count measure to area dimensions of the shape/polygon.

gowthamrao commented 5 years ago

Requirement 9: Clustering

We would like to know if in the various geographic/administrative area for subjects in a cohort, if there are subjects in close proximity to each other based on some variable of interest. It is a form of supervised learning.

Additional reading

pavgra commented 5 years ago

One of the core technical questions which we need to answer to implement the functionality described above is geometry processing across all databases supported by SqlRender. There are two options - deploy additional tool next to a source database to process queries like does the geometry contains point? (we cannot deploy such tool next to WebAPI because it will mean transferring of data out of the database which may break security constraints) but then we face a need for some orchestration component which complicates the environment a lot. Otherwise, we need to use GIS modules of the databases. Here research is required to prove that such modules are available for all of the DBs. Initial check shows following:

Microsoft SQL Server

Supports - https://docs.microsoft.com/en-us/sql/t-sql/spatial-geometry/stcontains-geometry-data-type?view=sql-server-2017
Oracle

Supports - https://docs.oracle.com/database/121/SPATL/sdo_contains.htm#SPATL1024
PostgreSQL

PostGIS is amazing!
Amazon RedShift

Should be possible with custom UDFs - http://skipperkongen.dk/2016/01/19/how-to-work-with-spatial-data-in-amazon-redshift/ . POC required.
Impala

Should be possible with custom UDFs:
- https://esri.github.io/gis-tools-for-hadoop/
- https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_udf.html
POC required.
IBM Netezza

Supports - https://www.ibm.com/developerworks/community/files/form/anonymous/api/library/1b6a2624-dc86-4856-b4ed-cdda6bfdecda/document/c17c1321-d9af-4f3f-91ed-286af78b9a6f/media/Netezza_Spatial_Package_Reference_Guide-3.0.0.pdf
Google BigQuery

Supports - https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_contains
Microsoft PDW

Does NOT suport. Reasearch required whether it is possible to implement smth via stored procedures. Tagging @chrisknoll here

gklebanov commented 5 years ago

As one of the options for technical implementation, instead of trying to build a solution that relies on 8+ databases and their support for Geospatial, why don't we consider the "ACHILLES" approach - e.g. pre-calculate all values post-ETL but pre-PRODUCTION use. Faster during analytical query time, consistent across all databases, less effort from implementation and support perspective.

Another thought, I am looking at 2.6 release. I like that we separated the cohort analysis (e.g. "characterization") from cohort design. And we should continue with this model and not embed analysis results back into cohort designer. Instead, we should sit down and discuss which of the above could fit well into characterization and which might even deserve a new dedicated analysis tab - "Geospatial reports"? I do not think that try to shove everything into characterization might be a good idea

pavgra commented 5 years ago

@gklebanov, such option with pre-calculated links between a person and all possible area levels would work (and is good!) for Cohort criteria and stratification in CC, but we still have a case of clustering. Clustering doesn't rely on a person's association with certain area, but is based on persons' locations and their interrelation. Here we still need spatial capabilities, and most possibly not only those but also stored procedures. So the question is whether such clustering is really required from the applied perspective. If we can skip this req, we are in really good shape.

gklebanov commented 5 years ago

Clustering doesn't rely on a person's association with certain area, but is based on persons' locations and their interrelation

I agree and feel that we need to divide and conquer. Let's solve 90% of the problem and then there is always 10% that is left. I feel that Clustering might that 10%.

Not being an expert in R, but I believe there are clustering R modules that exist - can we use those? And I am sure we can find other libraries, and not necessarily R, that we could plug into for clustering? Let's do some research on this?

pavgra commented 5 years ago

Again, most possibly R means that we need either a separate component and orchestration or transfer of a cohort data to middle-tier (than we actually do not need R but can proceed with Java), so that's a problem. But if community is OK with the "first - 90%, then - somehow 10%", we can roll our sleeves up and get to the work.

pavgra commented 5 years ago

So, let me summarize ideas for approaching the reqs from tech perspective:

for Cohort calculation (Requirement 3), Characterization (Requirement 6) and Stratification (Requirement 7) we just need to know relations of a subject to administrative areas. Those are static and therefore can be pre-calculated and stored in many-to-many table (subject_id, administrative_area_id). After that we just add necessary joins using FKs in SQL
for Clustering (Requirements 4, 8, 9) we cannot deal without a separate processing component since not all DBs support spatial functions. Having separate component means extra cost for processing requests and trying to do the clusters calculation in real-time might not provide the best user experience. Therefore, the proposed solution is to calculate clusters for all zoom levels (up to 20 if we look at how Google does it) once and then just query the results to provide better UX. And as we go for a separate component, R might be the best option for doing clustering. So here comes suggestion not to re-invent the wheel and use polished Arachne Execution Engine for R processing.

chandryou commented 5 years ago

As @pavgra suggested, we're thinking about changing GIS backbone system from GADM to OpenStreetMap in AEGIS. I'll start to make FeatureExtraction function for location based on this.

By using longitude and lattitude, the information about the location will be extracted from OpenStreetMap. And then according to the user-defined level, the latest location information will be extracted from CDM. It would be helpful for the future work, such as Cohort Characterization, calculation, and Stratification.

gowthamrao commented 5 years ago

@chandryou @pavgra @gklebanov before we do development work, best practice says we need to complete a proper design. Once design is done, we can divide work.

I'll start to make FeatureExtraction function for location based on this.

@chandryou we will also need to integrate with the work done here that allows us to use Atlas GUI to define features https://github.com/OHDSI/WebAPI/issues/495

By using longitude and lattitude, the information about the location will be extracted from OpenStreetMap. And then according to the user-defined level, the latest location information will be extracted from CDM.

We need to standardize the way we derive a person's, care_site or providers location using temporal information e.g. we may need the location_id of person in the 180 days prior to cohort start date. This will require a standard SQL using location_history table. So we will have to agree on definitions and conventions.

It would be helpful for the future work, such as Cohort Characterization, calculation, and Stratification. Absolutely - and we should be aligned with https://github.com/OHDSI/WebAPI/issues/495

pavgra commented 5 years ago

@gowthamrao , that's right and that's how we built all previous major features. So I would kindly ask @chandryou to hold on a bit. I am currently researching some remaining tech questions and as soon as I have clear answers for those, I'll start assembling UML diagrams

pavgra commented 5 years ago

Tech validation of the idea

Before opening my UML designer and laying out dev plan into diagrams, I've worked on checking that it is technically possible to extract administrative areas from OpenStreetMaps and find associations between a given geopoint (lat & lon of a person) and the areas using both SQL and R. Results are below.

Get administrative areas polygons

Download *.osm.pbf files for required areas via geofabrik
Install Osmosis

Clean up PBF data to leave only administrative areas and get rid of roads, places of interest, houses, etc:

"osmosis-latest/bin/osmosis" --read-pbf-fast workers=8 "pennsylvania-latest.osm.pbf" --tf accept-relations boundary=administrative --write-pbf file="admin_areas.osm.pbf"
"osmosis-latest/bin/osmosis" --read-pbf-fast workers=8 "admin_areas.osm.pbf" --used-way --write-pbf file="admin_areas2.osm.pbf"
"osmosis-latest/bin/osmosis" --read-pbf-fast workers=8 "admin_areas2.osm.pbf" --used-node --write-pbf file="admin_areas3.osm.pbf"

(e.g. original pennsylvania-latest.osm.pbf is 148mb; after cleaning it we remain with admin_areas3 which is only 2mb in size)

Query data using SQL

Install Postgis

Load PBF data via osm2pgsql:

osm2pgsql -d osm -U postgres -W --hstore admin_areas3.osm.pbf
# Osm2pgsql took 6s

Query data:

-- Query all counties

SELECT name, way
FROM planet_osm_polygon
WHERE boundary = 'administrative' AND admin_level = '6' -- County
ORDER BY name

-- Query administrative areas containing a given point (lat-lon, which can be retrieved from a person data)

SELECT name
FROM planet_osm_polygon
WHERE ST_Intersects(
    ST_GeomFromText('POINT(-74.925398 40.217084)', 4326),
    ST_Transform(way, 4326)
)
AND boundary = 'administrative'
ORDER BY admin_level

Query data using R

Install and load rgdal:

install.packages('rgdal')
library(rgdal)

Check build:

c('SQLite', 'OSM') %in% ogrDrivers()$name # should be TRUE

Load OSM data:

areas = rgdal::readOGR('admin_areas3.osm.pbf', 'multipolygons')

Define coordinate system:

crswgs84 <- CRS("+proj=longlat +datum=WGS84")

Transform areas:

transformedAreas <- spTransform(areas, crswgs84)

Define a person's location:

lat <- 40.217084
lng <- -74.925398
p <- SpatialPoints(list(lng,lat), proj4string=crswgs84)

Plot data to see it visiually:

plot(transformedAreas, col="lightgrey", axes=TRUE)
points(p, col = "red", cex = 5, lwd=2)

Query for areas containing the given point:

over(p, transformedAreas, returnList = TRUE)$`1`[c("osm_id", "name", "admin_level")]

fdefalco commented 5 years ago

As this feature is introduced I would like to suggest that we include the features in the data sources tab and provide database level geographic characterization in addition to the requested cohort level capabilities.

pavgra commented 5 years ago

Solution design

Use-cases summary

R1: Area loading, browsing and view. Calculation of relations between person and areas / care sites
R3: Cohort definition using location & area relations
R2/R4/R5/R8/R9: Cohort reporting (profiles summary):
- Google-like clustering of a cohort on a map, including
  - individual pins per person on the map at the max zoom level
  - ability to enter person’s ID and highlight the person on map
  - ability to navigate from a person’s marker to an appropriate Profile page
- Statistical area-based clustering (R9, out of scope for the first iteration)
- Density map
- Choropleth map
  - regular Choropleth map
  - normalized Choropleth map (normalizing count measure to area dimensions)
R6/R7: Cohort Characterization
- characterization of areas
- stratification by areas
- Choropleth map for geo features (R7.3)

Technical challenges

R1. Area representation and browsing

We need some master data which would represent geo hierarchies in CDM, could be browsed and sliced-and-diced, and would be used by geo criteria in Cohort Definitions / stratas in CC / etc. OMOP Vocabularies fit for storage of such data well since for geo ontologies we have concept names, concept relations and domains in the same way as we do for regular medical vocabs. Moreover, we could then browse the geo ontologies via Atlas Vocabulary tab, as well as build concept sets to be used in Cohort Definitions. Therefore, we are going to introduce OMOP Geo Vocabularies. First ones based on OSM and US Census data.

Also, for calculation of relations between people and areas / care sites, same as for highlighting administrative areas over map, we need to store polygons for the areas, which is going to be done in a table of WebAPI’s results schema (since that data is application specific).

R3. Cohort definition, Geo criteria

Here are some use cases which describe tight coupling of location with other criteria in cohort definition:

Cohort of people with heart failure who do not have a heart failure specialist within 20 miles
- condition_occurrence -> person -> location [condition_occurrence.date is between start and end dates of person’s location] -> distance from the location to nearest care site / provider > 20 miles
Cohort with entry dates equal to those when a person has started living in a given area / certain distance range from a care site / provider
- index date depends on location (equals to location.date)

Why cannot we use spatial SQL?

Therefore, the most obvious solution would be to use spatial SQL functions to calculate required geo relations, but Spatial (GIS) functionality is not supported by all DBs used in OHDSI:

Amazon RedShift theoretically could support it but with additional UDFs
- http://skipperkongen.dk/2016/01/19/how-to-work-with-spatial-data-in-amazon-redshift/
Impala theoretically could support it but with additional UDFs and Java / Scala packages deployed to Hive
- https://esri.github.io/gis-tools-for-hadoop/
- https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_udf.html
PDW does NOT support spatial functionality

So, we could try to use an external component to calculate the Geo Criteria in a separate step e.g. using R.

Limitations of geo criteria executed in R

If we imagine execution of Geo criteria in R, the following issues come out:

Geo criteria can be placed only into "Inclusion criteria" section since it should be processed separately (by R).
Geo criteria cannot be nested or cannot nest other criteria since it should be processed separately (by R).
Since geo criteria require processing by separate component, we cannot really embed them into circe expression - any change in circe expression (cohort JSON) would require reflection in Exported SQL.

These limitations are not acceptable.

Pre-calculation approach

Since the use-cases dictate a need to embed geo criteria into cohort definition, but this embedding cannot be done if we use an external geo calculation component, the only remaining option is pre-calculation of all geo relations required for cohort build process and their usage in SQL:

Location’s region pre-calculation Associations of locations with regions (point in polygon) can be pre-calculated and stored in many-to-many relationship table:

LOCATION_AREA

location_id 1 1 1

lowest_area_concept_id 2 (OSM) 3 (ISO) 4 (Hospital catchment areas)
Distance between person and care sites locations pre-calculation

LOCATION_DISTANCE

person_location_id 1

care_site_location_id 1

distance 10

unit_concept_id 1 (miles)

LOCATION_AREA
location_id	1	1	1
lowest_area_concept_id	2 (OSM)	3 (ISO)	4 (Hospital catchment areas)

LOCATION_DISTANCE
person_location_id	1
care_site_location_id	1
distance	10
unit_concept_id	1 (miles)

The tables for storing pre-calculated relations will be put into Derived tables section of OMOP. After we have pre-calculated the relations above, we just add necessary joins using FKs into Cohort Definition SQL.

The pre-calculation approach has already been discussed with community and appropriate tables will appear in next iteration of OMOP CDM: https://github.com/OHDSI/CommonDataModel/issues/220

R6 / R7. Cohort Characterization involving Geo relations

For Cohort Characterization the pre-calculated relations should also be enough and nothing additional is required.

R4. Use-cases with dynamic location relations

In case of dynamic relations, which occur when we talk about Clustering and Density maps for cohorts (there can be enormous amount of permutations of peoples and their locations between cohorts), we are going to use a separate processing component. The choice is because here we cannot pre-calculate things but although are not limited with tight-coupling to any SQL based calculations. As we go for a separate component, R might be the best option for calculating geo relations / distances and doing clustering (we can also call Java from R). So here comes the suggestion not to reinvent the wheel and use polished Arachne Execution Engine for R processing.

Even with non-statistical clustering (but Google-like) we need to calculate clusters and store them at back-end due to big amounts of data and inability of client-side to handle it (e.g. clustering of 1m patients at UI makes things unresponsive even at beefy PC).

R2 / R4 / R5. Visualization on map

To display things on a map on UI we are in a need for map tiles. Basically, we could go for usage of any public API (Google, Bing, OSM), but we are limited by:

ability to use only open-source solution
need to use the functionality in isolated environments, so need to deploy the tile server on premise

the most reasonable option seems to be OpenStreetMaps as a tile data source and mapnik + carto + mod_tile stack to serve the tiles (following bundle was tested - https://github.com/odysseusinc/osm-tiles-mapnik).

(GeoServer was evaluated as an alternative for mapnik + carto + mod_tile but there are big issues with tiles styling).

Polygon storage

It is proposed to use GeoJSON as a way to store complex geo data structures in unified way in DB (+ Java - https://github.com/opendatalab-de/geojson-jackson)

UI libraries

OpenLayers library was tested and chosen as a package providing a convenient way to work with multiple layers and supporting multiple tiles adapters out-of-the-box.

GeoJSON.js was chosen to support work with GeoJSON data format.

Licenses

Following are the new components and their licenses:

Component	License type
OpenStreetMaps data	“If you're writing software that uses OSM data, you may choose any license you like, including proprietary”https://wiki.openstreetmap.org/wiki/Legal_FAQ
mapnik	Mapnik software is free and is released under the LGPL v2.1 (GNU Lesser General Public License, version 2.1)https://github.com/mapnik/mapnik
openstreetmap-carto	CC0 Public Domain Dedicationhttps://github.com/gravitystorm/openstreetmap-carto/blob/master/LICENSE.txt
mod_tile	GNU General Public License v2.0https://github.com/openstreetmap/mod_tile/blob/master/COPYING

Technical design

Deployment diagram

The architecture is very flexible since all new components can be deployed on-demand: it is possible to use public OSM tile server instead of deploying a one on-premise. Same if you don’t need clustering or density map analyses, there is no need to deploy EE.

Data model (WebAPI + CDM)

Location criteria (circe-be)

The new Criteria to be introduced into circe:

Person location is (not) in area
- Input:
  - area concept set
  - date range
  - should the person be inside the given areas or vice-versa?
Person location is (not) within range from a care site
- Input:
  - care site concept set
  - date range
  - GeoRange

CC geo analyses (FeatureExtraction)

Need to extend FeatureExtraction package to include following Preset Feature analyses:

Demographic by state (current user location)

Demographics / By state

California 340 (34%)

Illinois 660 (66%)
Distribution analysis characterizing the euclidean distance between a subjects loction_id and their providers location_id in miles/kms as continuous value in cohort characterization. E.g. average/distribution distance from provider for each area.

Demographics / By state
California	340 (34%)
Illinois	660 (66%)

WebAPI class diagram

CC stratification

Criteria group to be used as a discriminator for stratas.

Execution engine package

Following features should be handled via OhdsiGIS package executed in EE:

populate derived tables (location_area & location_distance)
clustering of people in a cohort
density map calculation

Mockups

Below are mockups of new screens and UI elements.

Cohort reporting (visualizations over map)

Dot distribution map

There should be an ability to type in a person’s ID and highlight it on map
There should be an ability to navigate from a person’s marker to person’s profile

Geo Cohort characterization

Strata design

Stratified results

Choropleth map

There should be a map next to table with geo analysis / analysis with stratification by geo criteria, which will present Choropleth map of analysis results (in case of Distribution analyses - using average as area value)

Development plan

Task list:

Convert SynPUF to CDM v6 to have latitudes & longitudes presented in test data
Build geo vocabulary + polygon list based on OSM & US Census
Create (temporary) PostGIS SQL to populate LOCATION_AREA & LOCATION_DISTANCE tables
Create sample SQLs for cohort definitions with geo criteria to validate approach
Extend circe-be and embed new geo criteria into Cohort Definition
Area based feature analysis for Cohort Characterization (characterization of areas; new FeatureExtraction analyses)
Module for pre-calculation of geo relations
Stratification support for CC
Marker clustering + visualization
Density map analysis + visualization
Choropleth map + visualization
Choropleth visualizations for geo analyses in CC

pavgra commented 5 years ago

Crossing out Item 8 as completed

pavgra commented 5 years ago

Test data

Load US boundaries data

For development stage geo data was loaded from community provided service, which serves shapefiles with admin areas boundaries only (this is what we need; original OSM files contain much more data, e.g. roads, points of interest, etc). Specifically, I've loaded all US administrative boundaries: https://wambachers-osm.website/boundaries/exportBoundaries?cliVersion=1.0&cliKey=f3258920-2495-4da3-8255-747407dca341&exportFormat=shp&exportLayout=levels&exportAreas=Land&union=false&from_AL=2&to_AL=12&selected=148838

Geo capabilities in Postgres are provided by PostGIS:

CREATE EXTENSION postgis;

The data was loaded using shp2pgsql tool:

shp2pgsql -s 4326 "United States_AL2.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL4.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL5.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL6.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL7.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL8.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL9.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL10.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL11.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z

Indexing for the field with type geometry:

CREATE INDEX us_geo_geom ON geo.us USING GIST (geom);

Create test CDM data

Create LOCATION table in new format:

ALTER TABLE five_three_plus.location RENAME TO location_v5;

CREATE TABLE five_three_plus.location
(
  location_id           BIGINT       NOT NULL,
  address_1             VARCHAR(50)  NULL,
  address_2             VARCHAR(50)  NULL,
  city                  VARCHAR(50)  NULL,
  state                 VARCHAR(2)   NULL,
  zip                   VARCHAR(9)   NULL,
  county                VARCHAR(20)  NULL,
  country               VARCHAR(100) NULL,
  location_source_value VARCHAR(50)  NULL,
  latitude              NUMERIC      NULL,
  longitude             NUMERIC      NULL
);

-- Field introduced by https://github.com/OHDSI/CommonDataModel/issues/220
ALTER TABLE five_three_plus.location ADD region_concept_id INTEGER;

Populate new LOCATION table with randomly picked lat-lon pairs from specified counties and with specified frequency:

WITH
list AS (
  -- PA
  -- https://us-places.com/Pennsylvania/population-by-County.htm
  SELECT 417845 AS id, 1.5 AS count -- Philadelphia County
  UNION ALL
  SELECT 2563472 AS id, 1.23 AS count -- Allegheny County
  UNION ALL
  SELECT 417844 AS id, 0.808 AS count -- Montgomery County
  UNION ALL
  SELECT 417843 AS id, 0.626 AS count -- Bucks County
  UNION ALL
  SELECT 416694 AS id, 0.115 AS count -- Mercer County
  UNION ALL

  -- NJ
  -- https://us-places.com/New-Jersey/population-by-County.htm
  SELECT 958930 AS id, 0.919 AS count -- Bergen County
  UNION ALL
  SELECT 961153 AS id, 0.823 AS count -- Middlesex County
  UNION ALL
  SELECT 957235 AS id, 0.786 AS count -- Essex County
), multipoints AS (
  SELECT ST_GeneratePoints(ST_Transform(geom, 4326), l.count * 1000) g
  FROM geo.us areas
    JOIN list l  ON areas.id = l.id
)
, points AS (
  SELECT (ST_Dump(g)).geom p
  FROM multipoints
)
INSERT INTO five_three_plus.location
SELECT
  ROW_NUMBER() OVER () AS location_id,
  NULL AS address_1,
  NULL AS address_2,
  city.name AS city,
  split_part(state.iso3166_2, '-', 2) AS state,
  NULL AS zip,
  county.name AS county,
  'USA' AS country,
  NULL AS location_source_value,
  ST_Y(p) AS latitude,
  ST_X(p) AS longitude
FROM points p
  JOIN LATERAL (
    SELECT *
    FROM geo.us area
    WHERE ST_CONTAINS(area.geom, ST_SetSRID(ST_MakePoint(ST_X(p), ST_Y(p)), 4326))
      AND adminlevel = 4
    LIMIT 1
  ) state ON true
  JOIN LATERAL (
    SELECT *
    FROM geo.us area
    WHERE ST_CONTAINS(area.geom, ST_SetSRID(ST_MakePoint(ST_X(p), ST_Y(p)), 4326))
      AND adminlevel = 6
    LIMIT 1
  ) county ON true
  LEFT JOIN LATERAL (
    SELECT *
    FROM geo.us area
    WHERE ST_CONTAINS(area.geom, ST_SetSRID(ST_MakePoint(ST_X(p), ST_Y(p)), 4326))
      AND adminlevel = 8
    LIMIT 1
  ) city ON true;

Create new LOCATION_HISTORY table:

CREATE TABLE five_three_plus.location_history
(
  location_history_id          BIGINT      NOT NULL,
  location_id                  BIGINT      NOT NULL,
  relationship_type_concept_id INTEGER     NOT NULL,
  domain_id                    VARCHAR(50) NOT NULL,
  entity_id                    BIGINT      NOT NULL,
  start_date                   DATE        NOT NULL,
  end_date                     DATE        NULL
);

Drop NOT NULL constraint on location_history.relationship_type_concept_id until there is an answer regarding concepts to be used in the field: http://forums.ohdsi.org/t/concepts-for-location-history-relationship-type-concept-id/5525

ALTER TABLE five_three_plus.location_history ALTER relationship_type_concept_id DROP NOT NULL;

Populate location_history for person and care_site:

WITH loc_history AS (
  SELECT
    ROW_NUMBER() OVER () AS location_history_id,
    l.location_id,
    CAST(NULL AS INTEGER) AS relationship_type_concept_id,
    CASE WHEN l.location_id % 2 = 0
      THEN CAST('1990-01-01' AS DATE)
      ELSE CAST('2000-01-01' AS DATE)
    END AS start_date,
    CASE WHEN l.location_id % 2 = 0
      THEN CAST('1999-12-31' AS DATE)
      ELSE NULL
    END AS end_date
  FROM five_three_plus.location l
)
, care_site_loc_history_raw AS (
  SELECT
    location_history_id,
    location_id,
    relationship_type_concept_id,
     -- OMOP generated. Care site
    57 AS domain_id,
    start_date,
    end_date
  FROM loc_history
  ORDER BY random()
  LIMIT 100
)
, care_site_loc_history AS (
  SELECT
    *,
    ROW_NUMBER() OVER () AS entity_id
  FROM care_site_loc_history_raw
)
, person_loc_history AS (
  SELECT
    loc_history.*,
    -- OMOP generated   Person
    56 AS domain_id,
    ROW_NUMBER() OVER () AS entity_id
  FROM loc_history
  WHERE loc_history.location_history_id NOT IN (SELECT location_history_id FROM care_site_loc_history)
)
INSERT INTO five_three_plus.location_history
SELECT *
FROM (
  SELECT
    location_history_id,
    location_id,
    relationship_type_concept_id,
    domain_id,
    entity_id,
    start_date,
    end_date
  FROM care_site_loc_history

  UNION ALL

  SELECT
    location_history_id,
    location_id,
    relationship_type_concept_id,
    domain_id,
    entity_id,
    start_date,
    end_date
  FROM person_loc_history
) l;

Create test Geo Vocabulary

Create OMOP vocabulary for OSM:

INSERT INTO five_three_plus.vocabulary (vocabulary_id, vocabulary_name)
VALUES ('OSM', 'OpenStreetMap');

Create OSM concepts:

DELETE FROM five_three_plus.concept
WHERE concept_id > 2100000000;

WITH deduped_data AS (
  SELECT MIN(id) id
  FROM geo.us
  GROUP BY
    name,
    adminlevel,
    array_remove(
      CAST(regexp_split_to_array(rpath, ',') AS INT []),
      id
    )
)
INSERT INTO five_three_plus.concept
SELECT
  ROW_NUMBER() OVER () + 2100000000 AS concept_id,
  area.name as concept_name,
  'Geo' as domain_id,
  'OSM' as vocabulary_id,
  -- https://wiki.openstreetmap.org/wiki/Tag:boundary=administrative
  CASE CAST(area.adminlevel AS INT)
    WHEN 2 THEN 'Country'
    WHEN 4 THEN 'State'
    WHEN 5 THEN 'Agglomerate'
    WHEN 6 THEN 'County'
    WHEN 7 THEN 'Civil township'
    WHEN 8 THEN 'Municipality'
    WHEN 9 THEN 'Ward'
    WHEN 10 THEN 'Neighborhood'
    ELSE 'Unknown geo area'
  END as concept_class_id,
  'S' AS standard_concept,
  area.id as concept_code,
  CAST('1970-01-01' AS DATE) as valid_start_date,
  CAST('2099-12-31' AS DATE) as valid_end_date,
  NULL as invalid_reason
FROM deduped_data d
  JOIN geo.us area ON area.id = d.id;

Populate concept_ancestor:

DELETE FROM five_three_plus.concept_ancestor
WHERE ancestor_concept_id > 2100000000;

CREATE OR REPLACE FUNCTION array_uniq_stable(anyarray) RETURNS anyarray AS
$body$
SELECT
    array_agg(distinct_value ORDER BY first_index)
FROM
    (SELECT
        value AS distinct_value,
        min(index) AS first_index
    FROM
        unnest($1) WITH ORDINALITY AS input(value, index)
    GROUP BY
        value
    ) AS unique_input
;
$body$
LANGUAGE 'sql' IMMUTABLE STRICT;

WITH area_with_path_array AS (
    SELECT array_remove(
        CAST(regexp_split_to_array(
                 -- To transform exclude children from paths e.g. transform rpath = 5908799,5908815,251075,270623,165466,148838,0 for id=5908815 into rpath=5908815,251075,270623,165466,148838,0
                 SUBSTR(rpath, POSITION(CAST(id AS VARCHAR) IN rpath))
         , ',') AS INT []),
        0
    ) path_array,
    *
  FROM geo.us
)
, area AS (
    SELECT
      ancestor_concept.concept_id   AS ancestor_concept_id,
      descendant_concept.concept_id AS descendant_concept_id,
      level - 1                        min_levels_of_separation,
      level - 1                        max_levels_of_separation
    FROM area_with_path_array area
      JOIN LATERAL unnest(array_uniq_stable(area.path_array))
             WITH ORDINALITY AS a(ancestor_concept_code, level) ON TRUE
      JOIN five_three_plus.concept ancestor_concept
        ON ancestor_concept.concept_code = CAST(ancestor_concept_code AS VARCHAR) AND
           ancestor_concept.vocabulary_id = 'OSM'
      JOIN five_three_plus.concept descendant_concept
        ON descendant_concept.concept_code = CAST(area.id AS VARCHAR) AND descendant_concept.vocabulary_id = 'OSM'
)
INSERT INTO five_three_plus.concept_ancestor
SELECT *
FROM area;

Populate concept relationships:

INSERT INTO five_three_plus.concept_relationship
SELECT
  c.concept_id AS concept_id_1,
  c.concept_id AS concept_id_2,
  'Maps to' AS relationship_id,
  CAST('1970-01-01' AS DATE) AS valid_start_date,
  CAST('2099-12-31' AS DATE) AS valid_end_date
FROM five_three_plus.concept_ancestor ca
  JOIN five_three_plus.concept c ON c.concept_id = ca.ancestor_concept_id AND c.vocabulary_id = 'OSM'
WHERE min_levels_of_separation = 0 AND max_levels_of_separation = 0;

Create new Units:

-- Unit. Meters
INSERT INTO five_three_plus.concept (concept_id, concept_name, domain_id, vocabulary_id, concept_class_id, standard_concept, concept_code, valid_start_date, valid_end_date, invalid_reason)
VALUES (2110000000, 'Meters', 'Geo', 'Geo unit', 'Unit', 'S', 'OMOP generated', CAST('1970-01-01' AS DATE), CAST('2099-12-31' AS DATE), NULL );

Prepare results schema

Load geometries into results schema:

CREATE TABLE five_three_plus_results.area_polygon AS
SELECT c.concept_id, ST_AsGeoJSON(area.geom) polygon
FROM geo.us area
  JOIN five_three_plus.concept c ON c.concept_code = CAST(id AS VARCHAR) AND c.vocabulary_id = 'OSM';

CDM Derived tables

Create new derived tables:

CREATE TABLE five_three_plus.location_distance (
  person_location_id INTEGER,
  care_site_location_id INTEGER,
  distance NUMERIC,
  unit_concept_id INTEGER
);

Precalculate relations between LOCATION and lowest level areas:

-- TODO: how to handle points which are not in lowest level areas?

CREATE INDEX area_polygon_geom ON five_three_plus_results.area_polygon USING GIST (ST_SetSrid(ST_GeomFromGeoJSON(area_polygon.polygon), 4326));
CREATE INDEX location_point ON five_three_plus.location USING GIST (ST_SetSrid(ST_MakePoint(longitude, latitude), 4326));

WITH lowest_level_areas AS (
  SELECT area.*
  FROM five_three_plus_results.area_polygon area
  WHERE NOT EXISTS (SELECT * FROM five_three_plus.concept_ancestor WHERE ancestor_concept_id = area.concept_id AND min_levels_of_separation > 0)
)
, loc_area AS (
  SELECT l.location_id, area.concept_id
  FROM five_three_plus.location l
    JOIN lowest_level_areas area ON ST_CONTAINS(ST_SetSrid(ST_GeomFromGeoJSON(area.polygon), 4326), ST_SetSrid(ST_MakePoint(l.longitude, l.latitude), 4326))
)
UPDATE five_three_plus.location
SET region_concept_id = la.concept_id
FROM loc_area la
WHERE location.location_id = la.location_id;

Precalculate distance between care_site and person:

WITH care_site_person_locations AS (
  SELECT DISTINCT cs.care_site_id, cs_lh.location_id as care_site_location_id, p.person_id, p.location_id AS person_location_id
  FROM
    five_three_plus.visit_occurrence v
    JOIN five_three_plus.care_site cs ON cs.care_site_id = v.care_site_id
    JOIN five_three_plus.location_history cs_lh ON  cs_lh.domain_id = /* OMOP generated. Care site */ CAST(57 AS VARCHAR) AND cs.care_site_id = cs_lh.entity_id
    JOIN five_three_plus.location cs_l ON cs_l.location_id = cs_lh.location_id
    JOIN five_three_plus.person p ON p.person_id = v.person_id
    JOIN five_three_plus.location_history p_lh ON  p_lh.domain_id = /* OMOP generated. Person */ CAST(56 AS VARCHAR) AND p.person_id = p_lh.entity_id
    JOIN five_three_plus.location p_l ON p_l.location_id = p_lh.location_id
)
INSERT INTO five_three_plus.location_distance
SELECT
  -- person_id,
  person_location_id,
  -- person_location.longitude, person_location.latitude,
  -- care_site_id,
  care_site_location_id,
  -- care_site_location.longitude, care_site_location.latitude,
  ST_DISTANCE(
      ST_GeographyFromText('POINT(' || person_location.longitude || ' ' || person_location.latitude || ')'),
      ST_GeographyFromText('POINT(' || care_site_location.longitude || ' ' || care_site_location.latitude || ')')
  ),
  2110000000
FROM care_site_person_locations
  JOIN five_three_plus.location person_location ON person_location.location_id = person_location_id
  JOIN five_three_plus.location care_site_location ON care_site_location.location_id = care_site_location_id;

Example queries with geo criteria

Select all person who are currently living somewhere in Bucks county:

SELECT *
FROM five_three_plus.person p
  JOIN five_three_plus.location_history lh
    ON lh.domain_id = /* OMOP generated. Person */ CAST(56 AS VARCHAR) AND p.person_id = lh.entity_id
  JOIN five_three_plus.location l
    ON l.location_id = lh.location_id
  JOIN five_three_plus.concept_ancestor ca ON ca.descendant_concept_id = l.region_concept_id
WHERE
  -- Current location
  lh.end_date IS NULL
  AND
  -- Somewhere in Bucks County
  ca.ancestor_concept_id = 2100014076;

pavgra commented 5 years ago

Challenges with constructing geo hierarchy

Although, the dataset used above to build Geo vocabulary contained pre-built hierarchy paths, the original OSM data doesn't contain those, same as there is no guarantee that another datasets (e.g. US Census) will contain the paths.

Therefore, there is a need to be able to calculate paths based on given polygons (boundaries). I've examined some approaches and haven't succeeded yet, but wanted to share the experience.

ST_CONTAINS hasn't finished for the set of all US boundaries after 11 hours
Operator && which compares bounding boxes of given geometries (and uses indexes which makes it pretty fast) is not accurate enough for us - areas can be interpenetrating. Due to the same reasons ST_CONTAINS(parent_geom, ST_CENTROID(child_geom)) is not working properly

Calculation via ST_INTERSECTS is pretty slow (~1 hour for the dataset) and gives incorrect results:

SELECT *
FROM
/* NYC state */ (SELECT * FROM geo.us WHERE id = 61320) nyc JOIN /* Country in Vermont state */ (SELECT * FROM geo.us WHERE id = 60803) benn
ON ST_INTERSECTS(nyc.geom, benn.geom)

pavgra commented 5 years ago

Area hierarchy viewer

cgreich commented 5 years ago

Friends: Dug around a little in the OSM world. Talked to the folks at the Wambachers website mentioned above. Couple of things:

Polygon information is not consistently available. The coverage varies country by country.
In the US, we can probably only rely on states and counties. If they exist. I learned myself that of all states Massachusetts abolished half of its counties when they went bankrupt in the 90ies. We either substitute them with towns, or we won't have them.
Zip and Postcodes codes are not available
US Census polygons are not available
We must not make any changes to the content, otherwise we violate the ODbL license. I don't see a danger of that.

pavgra commented 5 years ago

Requirements 3, 6 and 7 were completed

StandardizedAnalysisAPI: https://github.com/OHDSI/StandardizedAnalysisAPI/pull/22 https://github.com/OHDSI/StandardizedAnalysisAPI/pull/24 https://github.com/OHDSI/StandardizedAnalysisAPI/pull/25 https://github.com/OHDSI/StandardizedAnalysisAPI/pull/26

circe-be: https://github.com/OHDSI/circe-be/pull/75 https://github.com/OHDSI/circe-be/pull/76 https://github.com/OHDSI/circe-be/pull/80 https://github.com/OHDSI/circe-be/pull/81

WebAPI: https://github.com/OHDSI/WebAPI/pull/1050/files https://github.com/OHDSI/WebAPI/pull/1061

Atlas: https://github.com/OHDSI/Atlas/pull/1669 https://github.com/OHDSI/Atlas/pull/1697

WebAPI-extensions: https://github.com/odysseusinc/WebAPI-extensions

pavgra commented 5 years ago

Requirements 2, 4, 5 related PRs (work in progress, to be updated):

Atlas:

WebAPI:

Arachne Commons:

Execution Engine:

WebPlatform:

https://github.com/odysseusinc/WebPlatform/pull/1

GisService:

OHDSI / WebAPI

Geospaital functionality in Atlas: integration of AEGIS #649

Geospatial functionality in Atlas: integration of Aegis

Tech validation of the idea

Get administrative areas polygons

Query data using SQL

Query data using R

Solution design

Use-cases summary

Technical challenges

R1. Area representation and browsing

R3. Cohort definition, Geo criteria

Why cannot we use spatial SQL?

Limitations of geo criteria executed in R

Pre-calculation approach

R6 / R7. Cohort Characterization involving Geo relations

R4. Use-cases with dynamic location relations

R2 / R4 / R5. Visualization on map

Polygon storage

UI libraries

Licenses

Technical design

Deployment diagram

Data model (WebAPI + CDM)

Location criteria (circe-be)

CC geo analyses (FeatureExtraction)

WebAPI class diagram

CC stratification

Execution engine package

Mockups

Cohort reporting (visualizations over map)

Dot distribution map

Geo Cohort characterization

Strata design

Stratified results

Choropleth map

Development plan

Test data

Load US boundaries data

Create test CDM data

Create test Geo Vocabulary

Prepare results schema

CDM Derived tables

Example queries with geo criteria

Challenges with constructing geo hierarchy

Area hierarchy viewer