Open gowthamrao opened 5 years ago
Use-cases:
Requirement 1: Area shape selection:
Requirement 2: Map tile visualization as background for shapes:
Requirement 3: Cohort definition using location table:
Example in Atlas/circe-be: that should be supported at both cohort entry events (qualified events) and inclusion criteria (inclusion events) are
In all cases above, the location_id is inferred by using the earliest valid location_id for the person_id/care_site_id/provider_id based on interval period of the qualified/inclusion event thru look-up to location_history
Requirement 4: Cohort profiles summary (Visualization)
Marker/dot distribution map: Each subject_id is a marker on a map (e.g. one-to-one dot distribution map) with marker clustering to (e.g. one-to-many dot density distribution map) we would like the option of clustering the markers using a visualization detailed here. This type of clustering is a form of reactive map visualization, and not a true cluster geospatial analysis.
Density map: This is the process of creating contours on a map based on 2D representation of dots on a map, where each dot is a subject_id. This visualization technique is described Binning is performed similar to histogram, but in a 2D space. Bins may be square or hexagon (hexbin), with color of each bin determined by either count (or for continuous feature per person - sum of the features). Contours are drawn to represent boundaries of similar distributions. D3 examples are here.
Requirement 5: Person profile summary (Visualization)
Requirement 6: Enhance cohort characterization using location table/domain:
Demographics | ||
---|---|---|
Male | 340 (34%) | |
Female | 660 (66%) | |
Age | ||
00 - 10 | 100 (10%) | |
10 - 20 | 200 (20%) | |
20 - 30 | 140 (14%) | |
Geographic location (Stage) | ||
California | 100 (10%) | |
Illinois | 200 (20%) | |
South Carolina | 300 (30%) |
Requirement 7: Cohort + area characterization:
Requirement 8: Choropleth map
Requirement 9: Clustering
Additional reading
One of the core technical questions which we need to answer to implement the functionality described above is geometry processing across all databases supported by SqlRender. There are two options - deploy additional tool next to a source database to process queries like does the geometry contains point?
(we cannot deploy such tool next to WebAPI because it will mean transferring of data out of the database which may break security constraints) but then we face a need for some orchestration component which complicates the environment a lot. Otherwise, we need to use GIS modules of the databases. Here research is required to prove that such modules are available for all of the DBs. Initial check shows following:
Microsoft SQL Server
Oracle
Supports - https://docs.oracle.com/database/121/SPATL/sdo_contains.htm#SPATL1024
PostgreSQL
PostGIS is amazing!
Amazon RedShift
Should be possible with custom UDFs - http://skipperkongen.dk/2016/01/19/how-to-work-with-spatial-data-in-amazon-redshift/ . POC required.
Impala
Should be possible with custom UDFs:
POC required.
IBM Netezza
Google BigQuery
Supports - https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_contains
Microsoft PDW
Does NOT suport. Reasearch required whether it is possible to implement smth via stored procedures. Tagging @chrisknoll here
As one of the options for technical implementation, instead of trying to build a solution that relies on 8+ databases and their support for Geospatial, why don't we consider the "ACHILLES" approach - e.g. pre-calculate all values post-ETL but pre-PRODUCTION use. Faster during analytical query time, consistent across all databases, less effort from implementation and support perspective.
Another thought, I am looking at 2.6 release. I like that we separated the cohort analysis (e.g. "characterization") from cohort design. And we should continue with this model and not embed analysis results back into cohort designer. Instead, we should sit down and discuss which of the above could fit well into characterization and which might even deserve a new dedicated analysis tab - "Geospatial reports"? I do not think that try to shove everything into characterization might be a good idea
@gklebanov, such option with pre-calculated links between a person and all possible area levels would work (and is good!) for Cohort criteria and stratification in CC, but we still have a case of clustering. Clustering doesn't rely on a person's association with certain area, but is based on persons' locations and their interrelation. Here we still need spatial capabilities, and most possibly not only those but also stored procedures. So the question is whether such clustering is really required from the applied perspective. If we can skip this req, we are in really good shape.
Clustering doesn't rely on a person's association with certain area, but is based on persons' locations and their interrelation
I agree and feel that we need to divide and conquer. Let's solve 90% of the problem and then there is always 10% that is left. I feel that Clustering might that 10%.
Not being an expert in R, but I believe there are clustering R modules that exist - can we use those? And I am sure we can find other libraries, and not necessarily R, that we could plug into for clustering? Let's do some research on this?
Again, most possibly R means that we need either a separate component and orchestration or transfer of a cohort data to middle-tier (than we actually do not need R but can proceed with Java), so that's a problem. But if community is OK with the "first - 90%, then - somehow 10%", we can roll our sleeves up and get to the work.
So, let me summarize ideas for approaching the reqs from tech perspective:
As @pavgra suggested, we're thinking about changing GIS backbone system from GADM to OpenStreetMap in AEGIS. I'll start to make FeatureExtraction function for location based on this.
By using longitude and lattitude, the information about the location will be extracted from OpenStreetMap. And then according to the user-defined level, the latest location information will be extracted from CDM. It would be helpful for the future work, such as Cohort Characterization, calculation, and Stratification.
@chandryou @pavgra @gklebanov before we do development work, best practice says we need to complete a proper design. Once design is done, we can divide work.
I'll start to make FeatureExtraction function for location based on this.
@chandryou we will also need to integrate with the work done here that allows us to use Atlas GUI to define features https://github.com/OHDSI/WebAPI/issues/495
By using longitude and lattitude, the information about the location will be extracted from OpenStreetMap. And then according to the user-defined level, the latest location information will be extracted from CDM.
We need to standardize the way we derive a person's, care_site or providers location using temporal information e.g. we may need the location_id of person in the 180 days prior to cohort start date. This will require a standard SQL using location_history table. So we will have to agree on definitions and conventions.
It would be helpful for the future work, such as Cohort Characterization, calculation, and Stratification. Absolutely - and we should be aligned with https://github.com/OHDSI/WebAPI/issues/495
@gowthamrao , that's right and that's how we built all previous major features. So I would kindly ask @chandryou to hold on a bit. I am currently researching some remaining tech questions and as soon as I have clear answers for those, I'll start assembling UML diagrams
Before opening my UML designer and laying out dev plan into diagrams, I've worked on checking that it is technically possible to extract administrative areas from OpenStreetMaps and find associations between a given geopoint (lat & lon of a person) and the areas using both SQL and R. Results are below.
"osmosis-latest/bin/osmosis" --read-pbf-fast workers=8 "pennsylvania-latest.osm.pbf" --tf accept-relations boundary=administrative --write-pbf file="admin_areas.osm.pbf"
"osmosis-latest/bin/osmosis" --read-pbf-fast workers=8 "admin_areas.osm.pbf" --used-way --write-pbf file="admin_areas2.osm.pbf"
"osmosis-latest/bin/osmosis" --read-pbf-fast workers=8 "admin_areas2.osm.pbf" --used-node --write-pbf file="admin_areas3.osm.pbf"
(e.g. original pennsylvania-latest.osm.pbf is 148mb; after cleaning it we remain with admin_areas3
which is only 2mb in size)
osm2pgsql -d osm -U postgres -W --hstore admin_areas3.osm.pbf
# Osm2pgsql took 6s
Query data:
-- Query all counties
SELECT name, way
FROM planet_osm_polygon
WHERE boundary = 'administrative' AND admin_level = '6' -- County
ORDER BY name
-- Query administrative areas containing a given point (lat-lon, which can be retrieved from a person data)
SELECT name
FROM planet_osm_polygon
WHERE ST_Intersects(
ST_GeomFromText('POINT(-74.925398 40.217084)', 4326),
ST_Transform(way, 4326)
)
AND boundary = 'administrative'
ORDER BY admin_level
Install and load rgdal
:
install.packages('rgdal')
library(rgdal)
Check build:
c('SQLite', 'OSM') %in% ogrDrivers()$name # should be TRUE
Load OSM data:
areas = rgdal::readOGR('admin_areas3.osm.pbf', 'multipolygons')
Define coordinate system:
crswgs84 <- CRS("+proj=longlat +datum=WGS84")
Transform areas:
transformedAreas <- spTransform(areas, crswgs84)
Define a person's location:
lat <- 40.217084
lng <- -74.925398
p <- SpatialPoints(list(lng,lat), proj4string=crswgs84)
Plot data to see it visiually:
plot(transformedAreas, col="lightgrey", axes=TRUE)
points(p, col = "red", cex = 5, lwd=2)
Query for areas containing the given point:
over(p, transformedAreas, returnList = TRUE)$`1`[c("osm_id", "name", "admin_level")]
As this feature is introduced I would like to suggest that we include the features in the data sources tab and provide database level geographic characterization in addition to the requested cohort level capabilities.
We need some master data which would represent geo hierarchies in CDM, could be browsed and sliced-and-diced, and would be used by geo criteria in Cohort Definitions / stratas in CC / etc. OMOP Vocabularies fit for storage of such data well since for geo ontologies we have concept names, concept relations and domains in the same way as we do for regular medical vocabs. Moreover, we could then browse the geo ontologies via Atlas Vocabulary tab, as well as build concept sets to be used in Cohort Definitions. Therefore, we are going to introduce OMOP Geo Vocabularies. First ones based on OSM and US Census data.
Also, for calculation of relations between people and areas / care sites, same as for highlighting administrative areas over map, we need to store polygons for the areas, which is going to be done in a table of WebAPI’s results schema (since that data is application specific).
Here are some use cases which describe tight coupling of location with other criteria in cohort definition:
Therefore, the most obvious solution would be to use spatial SQL functions to calculate required geo relations, but Spatial (GIS) functionality is not supported by all DBs used in OHDSI:
So, we could try to use an external component to calculate the Geo Criteria in a separate step e.g. using R.
If we imagine execution of Geo criteria in R, the following issues come out:
These limitations are not acceptable.
Since the use-cases dictate a need to embed geo criteria into cohort definition, but this embedding cannot be done if we use an external geo calculation component, the only remaining option is pre-calculation of all geo relations required for cohort build process and their usage in SQL:
Location’s region pre-calculation Associations of locations with regions (point in polygon) can be pre-calculated and stored in many-to-many relationship table:
LOCATION_AREA | |||
---|---|---|---|
location_id | 1 | 1 | 1 |
lowest_area_concept_id | 2 (OSM) | 3 (ISO) | 4 (Hospital catchment areas) |
Distance between person and care sites locations pre-calculation
LOCATION_DISTANCE | |
---|---|
person_location_id | 1 |
care_site_location_id | 1 |
distance | 10 |
unit_concept_id | 1 (miles) |
The tables for storing pre-calculated relations will be put into Derived tables section of OMOP. After we have pre-calculated the relations above, we just add necessary joins using FKs into Cohort Definition SQL.
The pre-calculation approach has already been discussed with community and appropriate tables will appear in next iteration of OMOP CDM: https://github.com/OHDSI/CommonDataModel/issues/220
For Cohort Characterization the pre-calculated relations should also be enough and nothing additional is required.
In case of dynamic relations, which occur when we talk about Clustering and Density maps for cohorts (there can be enormous amount of permutations of peoples and their locations between cohorts), we are going to use a separate processing component. The choice is because here we cannot pre-calculate things but although are not limited with tight-coupling to any SQL based calculations. As we go for a separate component, R might be the best option for calculating geo relations / distances and doing clustering (we can also call Java from R). So here comes the suggestion not to reinvent the wheel and use polished Arachne Execution Engine for R processing.
Even with non-statistical clustering (but Google-like) we need to calculate clusters and store them at back-end due to big amounts of data and inability of client-side to handle it (e.g. clustering of 1m patients at UI makes things unresponsive even at beefy PC).
To display things on a map on UI we are in a need for map tiles. Basically, we could go for usage of any public API (Google, Bing, OSM), but we are limited by:
the most reasonable option seems to be OpenStreetMaps as a tile data source and mapnik + carto + mod_tile stack to serve the tiles (following bundle was tested - https://github.com/odysseusinc/osm-tiles-mapnik).
(GeoServer was evaluated as an alternative for mapnik + carto + mod_tile but there are big issues with tiles styling).
It is proposed to use GeoJSON as a way to store complex geo data structures in unified way in DB (+ Java - https://github.com/opendatalab-de/geojson-jackson)
OpenLayers library was tested and chosen as a package providing a convenient way to work with multiple layers and supporting multiple tiles adapters out-of-the-box.
GeoJSON.js was chosen to support work with GeoJSON data format.
Following are the new components and their licenses:
Component | License type |
---|---|
OpenStreetMaps data | “If you're writing software that uses OSM data, you may choose any license you like, including proprietary”https://wiki.openstreetmap.org/wiki/Legal_FAQ |
mapnik | Mapnik software is free and is released under the LGPL v2.1 (GNU Lesser General Public License, version 2.1)https://github.com/mapnik/mapnik |
openstreetmap-carto | CC0 Public Domain Dedicationhttps://github.com/gravitystorm/openstreetmap-carto/blob/master/LICENSE.txt |
mod_tile | GNU General Public License v2.0https://github.com/openstreetmap/mod_tile/blob/master/COPYING |
The architecture is very flexible since all new components can be deployed on-demand: it is possible to use public OSM tile server instead of deploying a one on-premise. Same if you don’t need clustering or density map analyses, there is no need to deploy EE.
The new Criteria to be introduced into circe:
Need to extend FeatureExtraction package to include following Preset Feature analyses:
Demographic by state (current user location)
Demographics / By state | |
---|---|
California | 340 (34%) |
Illinois | 660 (66%) |
Distribution analysis characterizing the euclidean distance between a subjects loction_id and their providers location_id in miles/kms as continuous value in cohort characterization. E.g. average/distribution distance from provider for each area.
Criteria group to be used as a discriminator for stratas.
Following features should be handled via OhdsiGIS package executed in EE:
Below are mockups of new screens and UI elements.
There should be a map next to table with geo analysis / analysis with stratification by geo criteria, which will present Choropleth map of analysis results (in case of Distribution analyses - using average as area value)
Task list:
Crossing out Item 8 as completed
For development stage geo data was loaded from community provided service, which serves shapefiles with admin areas boundaries only (this is what we need; original OSM files contain much more data, e.g. roads, points of interest, etc). Specifically, I've loaded all US administrative boundaries: https://wambachers-osm.website/boundaries/exportBoundaries?cliVersion=1.0&cliKey=f3258920-2495-4da3-8255-747407dca341&exportFormat=shp&exportLayout=levels&exportAreas=Land&union=false&from_AL=2&to_AL=12&selected=148838
Geo capabilities in Postgres are provided by PostGIS:
CREATE EXTENSION postgis;
The data was loaded using shp2pgsql
tool:
shp2pgsql -s 4326 "United States_AL2.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL4.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL5.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL6.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL7.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL8.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL9.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL10.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
shp2pgsql -s 4326 -a "United States_AL11.shp" geo.us | psql -h localhost -d synpuf_110k_cu -U ohdsi -z
Indexing for the field with type geometry
:
CREATE INDEX us_geo_geom ON geo.us USING GIST (geom);
Create LOCATION table in new format:
ALTER TABLE five_three_plus.location RENAME TO location_v5;
CREATE TABLE five_three_plus.location
(
location_id BIGINT NOT NULL,
address_1 VARCHAR(50) NULL,
address_2 VARCHAR(50) NULL,
city VARCHAR(50) NULL,
state VARCHAR(2) NULL,
zip VARCHAR(9) NULL,
county VARCHAR(20) NULL,
country VARCHAR(100) NULL,
location_source_value VARCHAR(50) NULL,
latitude NUMERIC NULL,
longitude NUMERIC NULL
);
-- Field introduced by https://github.com/OHDSI/CommonDataModel/issues/220
ALTER TABLE five_three_plus.location ADD region_concept_id INTEGER;
Populate new LOCATION table with randomly picked lat-lon pairs from specified counties and with specified frequency:
WITH
list AS (
-- PA
-- https://us-places.com/Pennsylvania/population-by-County.htm
SELECT 417845 AS id, 1.5 AS count -- Philadelphia County
UNION ALL
SELECT 2563472 AS id, 1.23 AS count -- Allegheny County
UNION ALL
SELECT 417844 AS id, 0.808 AS count -- Montgomery County
UNION ALL
SELECT 417843 AS id, 0.626 AS count -- Bucks County
UNION ALL
SELECT 416694 AS id, 0.115 AS count -- Mercer County
UNION ALL
-- NJ
-- https://us-places.com/New-Jersey/population-by-County.htm
SELECT 958930 AS id, 0.919 AS count -- Bergen County
UNION ALL
SELECT 961153 AS id, 0.823 AS count -- Middlesex County
UNION ALL
SELECT 957235 AS id, 0.786 AS count -- Essex County
), multipoints AS (
SELECT ST_GeneratePoints(ST_Transform(geom, 4326), l.count * 1000) g
FROM geo.us areas
JOIN list l ON areas.id = l.id
)
, points AS (
SELECT (ST_Dump(g)).geom p
FROM multipoints
)
INSERT INTO five_three_plus.location
SELECT
ROW_NUMBER() OVER () AS location_id,
NULL AS address_1,
NULL AS address_2,
city.name AS city,
split_part(state.iso3166_2, '-', 2) AS state,
NULL AS zip,
county.name AS county,
'USA' AS country,
NULL AS location_source_value,
ST_Y(p) AS latitude,
ST_X(p) AS longitude
FROM points p
JOIN LATERAL (
SELECT *
FROM geo.us area
WHERE ST_CONTAINS(area.geom, ST_SetSRID(ST_MakePoint(ST_X(p), ST_Y(p)), 4326))
AND adminlevel = 4
LIMIT 1
) state ON true
JOIN LATERAL (
SELECT *
FROM geo.us area
WHERE ST_CONTAINS(area.geom, ST_SetSRID(ST_MakePoint(ST_X(p), ST_Y(p)), 4326))
AND adminlevel = 6
LIMIT 1
) county ON true
LEFT JOIN LATERAL (
SELECT *
FROM geo.us area
WHERE ST_CONTAINS(area.geom, ST_SetSRID(ST_MakePoint(ST_X(p), ST_Y(p)), 4326))
AND adminlevel = 8
LIMIT 1
) city ON true;
Create new LOCATION_HISTORY table:
CREATE TABLE five_three_plus.location_history
(
location_history_id BIGINT NOT NULL,
location_id BIGINT NOT NULL,
relationship_type_concept_id INTEGER NOT NULL,
domain_id VARCHAR(50) NOT NULL,
entity_id BIGINT NOT NULL,
start_date DATE NOT NULL,
end_date DATE NULL
);
Drop NOT NULL constraint
on location_history.relationship_type_concept_id until there is an answer regarding concepts to be used in the field: http://forums.ohdsi.org/t/concepts-for-location-history-relationship-type-concept-id/5525
ALTER TABLE five_three_plus.location_history ALTER relationship_type_concept_id DROP NOT NULL;
Populate location_history
for person
and care_site
:
WITH loc_history AS (
SELECT
ROW_NUMBER() OVER () AS location_history_id,
l.location_id,
CAST(NULL AS INTEGER) AS relationship_type_concept_id,
CASE WHEN l.location_id % 2 = 0
THEN CAST('1990-01-01' AS DATE)
ELSE CAST('2000-01-01' AS DATE)
END AS start_date,
CASE WHEN l.location_id % 2 = 0
THEN CAST('1999-12-31' AS DATE)
ELSE NULL
END AS end_date
FROM five_three_plus.location l
)
, care_site_loc_history_raw AS (
SELECT
location_history_id,
location_id,
relationship_type_concept_id,
-- OMOP generated. Care site
57 AS domain_id,
start_date,
end_date
FROM loc_history
ORDER BY random()
LIMIT 100
)
, care_site_loc_history AS (
SELECT
*,
ROW_NUMBER() OVER () AS entity_id
FROM care_site_loc_history_raw
)
, person_loc_history AS (
SELECT
loc_history.*,
-- OMOP generated Person
56 AS domain_id,
ROW_NUMBER() OVER () AS entity_id
FROM loc_history
WHERE loc_history.location_history_id NOT IN (SELECT location_history_id FROM care_site_loc_history)
)
INSERT INTO five_three_plus.location_history
SELECT *
FROM (
SELECT
location_history_id,
location_id,
relationship_type_concept_id,
domain_id,
entity_id,
start_date,
end_date
FROM care_site_loc_history
UNION ALL
SELECT
location_history_id,
location_id,
relationship_type_concept_id,
domain_id,
entity_id,
start_date,
end_date
FROM person_loc_history
) l;
Create OMOP vocabulary for OSM:
INSERT INTO five_three_plus.vocabulary (vocabulary_id, vocabulary_name)
VALUES ('OSM', 'OpenStreetMap');
Create OSM concepts:
DELETE FROM five_three_plus.concept
WHERE concept_id > 2100000000;
WITH deduped_data AS (
SELECT MIN(id) id
FROM geo.us
GROUP BY
name,
adminlevel,
array_remove(
CAST(regexp_split_to_array(rpath, ',') AS INT []),
id
)
)
INSERT INTO five_three_plus.concept
SELECT
ROW_NUMBER() OVER () + 2100000000 AS concept_id,
area.name as concept_name,
'Geo' as domain_id,
'OSM' as vocabulary_id,
-- https://wiki.openstreetmap.org/wiki/Tag:boundary=administrative
CASE CAST(area.adminlevel AS INT)
WHEN 2 THEN 'Country'
WHEN 4 THEN 'State'
WHEN 5 THEN 'Agglomerate'
WHEN 6 THEN 'County'
WHEN 7 THEN 'Civil township'
WHEN 8 THEN 'Municipality'
WHEN 9 THEN 'Ward'
WHEN 10 THEN 'Neighborhood'
ELSE 'Unknown geo area'
END as concept_class_id,
'S' AS standard_concept,
area.id as concept_code,
CAST('1970-01-01' AS DATE) as valid_start_date,
CAST('2099-12-31' AS DATE) as valid_end_date,
NULL as invalid_reason
FROM deduped_data d
JOIN geo.us area ON area.id = d.id;
Populate concept_ancestor
:
DELETE FROM five_three_plus.concept_ancestor
WHERE ancestor_concept_id > 2100000000;
CREATE OR REPLACE FUNCTION array_uniq_stable(anyarray) RETURNS anyarray AS
$body$
SELECT
array_agg(distinct_value ORDER BY first_index)
FROM
(SELECT
value AS distinct_value,
min(index) AS first_index
FROM
unnest($1) WITH ORDINALITY AS input(value, index)
GROUP BY
value
) AS unique_input
;
$body$
LANGUAGE 'sql' IMMUTABLE STRICT;
WITH area_with_path_array AS (
SELECT array_remove(
CAST(regexp_split_to_array(
-- To transform exclude children from paths e.g. transform rpath = 5908799,5908815,251075,270623,165466,148838,0 for id=5908815 into rpath=5908815,251075,270623,165466,148838,0
SUBSTR(rpath, POSITION(CAST(id AS VARCHAR) IN rpath))
, ',') AS INT []),
0
) path_array,
*
FROM geo.us
)
, area AS (
SELECT
ancestor_concept.concept_id AS ancestor_concept_id,
descendant_concept.concept_id AS descendant_concept_id,
level - 1 min_levels_of_separation,
level - 1 max_levels_of_separation
FROM area_with_path_array area
JOIN LATERAL unnest(array_uniq_stable(area.path_array))
WITH ORDINALITY AS a(ancestor_concept_code, level) ON TRUE
JOIN five_three_plus.concept ancestor_concept
ON ancestor_concept.concept_code = CAST(ancestor_concept_code AS VARCHAR) AND
ancestor_concept.vocabulary_id = 'OSM'
JOIN five_three_plus.concept descendant_concept
ON descendant_concept.concept_code = CAST(area.id AS VARCHAR) AND descendant_concept.vocabulary_id = 'OSM'
)
INSERT INTO five_three_plus.concept_ancestor
SELECT *
FROM area;
Populate concept relationships:
INSERT INTO five_three_plus.concept_relationship
SELECT
c.concept_id AS concept_id_1,
c.concept_id AS concept_id_2,
'Maps to' AS relationship_id,
CAST('1970-01-01' AS DATE) AS valid_start_date,
CAST('2099-12-31' AS DATE) AS valid_end_date
FROM five_three_plus.concept_ancestor ca
JOIN five_three_plus.concept c ON c.concept_id = ca.ancestor_concept_id AND c.vocabulary_id = 'OSM'
WHERE min_levels_of_separation = 0 AND max_levels_of_separation = 0;
Create new Units:
-- Unit. Meters
INSERT INTO five_three_plus.concept (concept_id, concept_name, domain_id, vocabulary_id, concept_class_id, standard_concept, concept_code, valid_start_date, valid_end_date, invalid_reason)
VALUES (2110000000, 'Meters', 'Geo', 'Geo unit', 'Unit', 'S', 'OMOP generated', CAST('1970-01-01' AS DATE), CAST('2099-12-31' AS DATE), NULL );
Load geometries into results schema:
CREATE TABLE five_three_plus_results.area_polygon AS
SELECT c.concept_id, ST_AsGeoJSON(area.geom) polygon
FROM geo.us area
JOIN five_three_plus.concept c ON c.concept_code = CAST(id AS VARCHAR) AND c.vocabulary_id = 'OSM';
Create new derived tables:
CREATE TABLE five_three_plus.location_distance (
person_location_id INTEGER,
care_site_location_id INTEGER,
distance NUMERIC,
unit_concept_id INTEGER
);
Precalculate relations between LOCATION and lowest level areas:
-- TODO: how to handle points which are not in lowest level areas?
CREATE INDEX area_polygon_geom ON five_three_plus_results.area_polygon USING GIST (ST_SetSrid(ST_GeomFromGeoJSON(area_polygon.polygon), 4326));
CREATE INDEX location_point ON five_three_plus.location USING GIST (ST_SetSrid(ST_MakePoint(longitude, latitude), 4326));
WITH lowest_level_areas AS (
SELECT area.*
FROM five_three_plus_results.area_polygon area
WHERE NOT EXISTS (SELECT * FROM five_three_plus.concept_ancestor WHERE ancestor_concept_id = area.concept_id AND min_levels_of_separation > 0)
)
, loc_area AS (
SELECT l.location_id, area.concept_id
FROM five_three_plus.location l
JOIN lowest_level_areas area ON ST_CONTAINS(ST_SetSrid(ST_GeomFromGeoJSON(area.polygon), 4326), ST_SetSrid(ST_MakePoint(l.longitude, l.latitude), 4326))
)
UPDATE five_three_plus.location
SET region_concept_id = la.concept_id
FROM loc_area la
WHERE location.location_id = la.location_id;
Precalculate distance between care_site
and person
:
WITH care_site_person_locations AS (
SELECT DISTINCT cs.care_site_id, cs_lh.location_id as care_site_location_id, p.person_id, p.location_id AS person_location_id
FROM
five_three_plus.visit_occurrence v
JOIN five_three_plus.care_site cs ON cs.care_site_id = v.care_site_id
JOIN five_three_plus.location_history cs_lh ON cs_lh.domain_id = /* OMOP generated. Care site */ CAST(57 AS VARCHAR) AND cs.care_site_id = cs_lh.entity_id
JOIN five_three_plus.location cs_l ON cs_l.location_id = cs_lh.location_id
JOIN five_three_plus.person p ON p.person_id = v.person_id
JOIN five_three_plus.location_history p_lh ON p_lh.domain_id = /* OMOP generated. Person */ CAST(56 AS VARCHAR) AND p.person_id = p_lh.entity_id
JOIN five_three_plus.location p_l ON p_l.location_id = p_lh.location_id
)
INSERT INTO five_three_plus.location_distance
SELECT
-- person_id,
person_location_id,
-- person_location.longitude, person_location.latitude,
-- care_site_id,
care_site_location_id,
-- care_site_location.longitude, care_site_location.latitude,
ST_DISTANCE(
ST_GeographyFromText('POINT(' || person_location.longitude || ' ' || person_location.latitude || ')'),
ST_GeographyFromText('POINT(' || care_site_location.longitude || ' ' || care_site_location.latitude || ')')
),
2110000000
FROM care_site_person_locations
JOIN five_three_plus.location person_location ON person_location.location_id = person_location_id
JOIN five_three_plus.location care_site_location ON care_site_location.location_id = care_site_location_id;
Select all person who are currently living somewhere in Bucks county:
SELECT *
FROM five_three_plus.person p
JOIN five_three_plus.location_history lh
ON lh.domain_id = /* OMOP generated. Person */ CAST(56 AS VARCHAR) AND p.person_id = lh.entity_id
JOIN five_three_plus.location l
ON l.location_id = lh.location_id
JOIN five_three_plus.concept_ancestor ca ON ca.descendant_concept_id = l.region_concept_id
WHERE
-- Current location
lh.end_date IS NULL
AND
-- Somewhere in Bucks County
ca.ancestor_concept_id = 2100014076;
Although, the dataset used above to build Geo vocabulary contained pre-built hierarchy paths, the original OSM data doesn't contain those, same as there is no guarantee that another datasets (e.g. US Census) will contain the paths.
Therefore, there is a need to be able to calculate paths based on given polygons (boundaries). I've examined some approaches and haven't succeeded yet, but wanted to share the experience.
ST_CONTAINS
hasn't finished for the set of all US boundaries after 11 hours&&
which compares bounding boxes of given geometries (and uses indexes which makes it pretty fast) is not accurate enough for us - areas can be interpenetrating. Due to the same reasons ST_CONTAINS(parent_geom, ST_CENTROID(child_geom))
is not working properlyST_INTERSECTS
is pretty slow (~1 hour for the dataset) and gives incorrect results:
SELECT *
FROM
/* NYC state */ (SELECT * FROM geo.us WHERE id = 61320) nyc JOIN /* Country in Vermont state */ (SELECT * FROM geo.us WHERE id = 60803) benn
ON ST_INTERSECTS(nyc.geom, benn.geom)
Friends: Dug around a little in the OSM world. Talked to the folks at the Wambachers website mentioned above. Couple of things:
Requirements 3, 6 and 7 were completed
StandardizedAnalysisAPI: https://github.com/OHDSI/StandardizedAnalysisAPI/pull/22 https://github.com/OHDSI/StandardizedAnalysisAPI/pull/24 https://github.com/OHDSI/StandardizedAnalysisAPI/pull/25 https://github.com/OHDSI/StandardizedAnalysisAPI/pull/26
circe-be: https://github.com/OHDSI/circe-be/pull/75 https://github.com/OHDSI/circe-be/pull/76 https://github.com/OHDSI/circe-be/pull/80 https://github.com/OHDSI/circe-be/pull/81
WebAPI: https://github.com/OHDSI/WebAPI/pull/1050/files https://github.com/OHDSI/WebAPI/pull/1061
Atlas: https://github.com/OHDSI/Atlas/pull/1669 https://github.com/OHDSI/Atlas/pull/1697
WebAPI-extensions: https://github.com/odysseusinc/WebAPI-extensions
Requirements 2, 4, 5 related PRs (work in progress, to be updated):
Atlas:
WebAPI:
Arachne Commons:
Execution Engine:
WebPlatform:
GisService:
Geospatial functionality in Atlas: integration of Aegis
Authors: Gowtham Rao, Seng Chan You, Jaehyeong Cho, Andrew Williams, Robert Miller, Pavel Grafkin, Gregory Klebanov
Motivation: During the 2018 OHDSI Symposium, Washington DC USA - J Cho, SC You, K Kim, Y Soh, D Kim, RW Park - presented a software demonstration called 'Application for Epidemiological Geographic Information System (AEGIS) - An open source spatial analysis tool based on CDM' . See AEGIS under related work.
AEGIS, was built to support 5.x version of the OMOP CDM. This version did not have latitude and longitude in location table. AEGIS developers used observation and fact_relationship table to design AEGIS using CDM 5.x. The OMOP CDM 6+ (released October 2018) has two location tables (location and location_history). The location table has fields for latitude and longitude. These new fields may be used to represent precise location of persons, providers or care_sites.
During the presentation, a decision was made to upgrade AEGIS to support CDM 6.x with new location table, and to evaluate if it was possible to integrate AEGIS like functionality, and the work of the OHDSI GIS workgroup into ATLAS.
Background: Spatial epidemiology is the description, analysis or surveillance of a populations health related factors such as medical service, diseases, in relation to other person level or area level factors like demographic, environmental exposure, behavioral determinants, socio-economic indicators, genetic and infectious risk factors. Two types of spatial epidemiology are discussed below.
Descriptive mapping, widely used in spatial epidemiology, is useful for establishing initial hypotheses about the patterns of incidence/prevalence in an area, or the correlation between exposure to specific factors and disease.
Cluster detection is a more advanced statistical method that may reveal geographic clusters, based on patterns and spatial correlation.
AEGIS is described here