OHDSI / GIS

https://ohdsi.github.io/GIS
Apache License 2.0
8 stars 9 forks source link

Define set of relationships necessary for Symposium use cases #225

Closed kzollove closed 11 months ago

kzollove commented 1 year ago

The relationships we need to represent are those between point locations (addresses, either of patients or care sites) and locations of features (typically polygons, lines or points; example being a county or state as a polygon, a highway as a line, or a point as a monitoring station, care site, or other discrete location)

Example: "point is within polygon" would be a single relationship we will need to represent. This concept would be used to represent relationships such as "person lived within a county where percentile percentage of poverty estimate is greater than 0.75" or "person lived within a county where PM2.5 was greater than 10".

A second layer to this ask is temporal relationships: "person lived within X county for at least 5 years" "person lived within X county where Y variable was greater than Z for at least 5 years"

This may be a rabbit hole of a problem. For the time being, focus on defining spatial relationships that are pertinent to the October Symposium use cases.

p-talapova commented 1 year ago

To represent such relationships we need:

  1. Location and location_history tables to store information about the locations, including longitude and latitude coordinates. The location_history table must capture multiple addresses for each person_id, noting the start and end date for each residence. This ensures that even if someone relocates, all their prior locations and associated exposures are traceable.

  2. Geography table to store information about geospatial features like polygons (counties or states), lines (highways), or points (monitoring stations, care sites, or other discrete locations). Each feature should possess a unique identifier, geometry data, and rich metadata (e.g. JSON). Also, we need to add all features as a separate OMOP vocabulary.

geography_id geometry metadata valid_start_date valid_end_date
101 POINT(-122.425891, 37.774929) {"description": "Air Quality Monitoring Station", "station_name": "A"} 2018-01-01 2023-01-01
102 POINT(-121.425891, 38.774929) {"description": "Air Quality Monitoring Station", "station_name": "B"} 2017-05-15 2022-05-15
103 POLYGON(...) {"description": "Region", "region_name": "A"} 2015-06-10 2025-06-10
104 POLYGON(...) {"description": "Region", "region_name": "B"} 2016-09-01 2026-09-01

Consider adding valid_start_date and valid_end_date time fields, especially for features that are temporal in nature, like mobile monitoring stations.

  1. Geom Relationship table to define the spatial relationships between geometries, specifically between point locations (like addresses) and geographic features (like regions or monitoring stations). An extended metadata field, (JSON or plain text), would be beneficial here. In addition to "Within", expand the vocabulary of geom_relationship_type to include "Adjacent to", "Intersects", "Overlaps", and other relevant types.
geom_relationship_id location_id geography_id geom_relationship_type metadata
201 1001 101 Within {"relationship_description": "Patient's residence", "feature": "monitoring station", "feature_name": "A"}
202 1002 102 Within {"relationship_description": "Patient's residence", "feature": "monitoring station", "feature_name": "B"}
203 1003 103 Within {"relationship_description": "Patient's residence", "feature": "region", "region_name": "A"}
204 1004 104 Within {"relationship_description": "Patient's residence", "feature": "region", "region_name": "B"}
205 1005 101 Adjacent to {"relationship_description": "Patient's property line", "feature": "monitoring station", "feature_name": "A"}
207 1007 103 Intersects {"relationship_description": "Patient's commute route", "feature": "region", "region_name": "A"}
209 1009 101 Overlaps {"relationship_description": "Patient's recreational area", "feature": "monitoring station", "feature_name": "A"}

Additionally, we need the list of all possible geom_relationship_types to be added as a separate vocabulary.

  1. Exposure Table to detail exposure information, considering both spatial and temporal dimensions.

Source concept template: "Person lived within X county where Y variable was greater than Z for at least 5 years"

Field Example Value Description
geom_relationship_id 203 A unique key given to a unique Geometry
location_id 54321 A unique key given to a unique Location, in this case County X
person_id 12345 A unique key given to the individual person who lived in County X
exposure_concept_id XXXXXX (int) A unique key given to the exposure record for the person/location
omop_concept_id XXXXXX (int) A unique key given to the standard OMOP equivalent exposure concept
exposure_start_date 2018-08-21 The start date of the exposure, for example, when the person started living in County X
exposure_end_date 2023-08-21 The end date of the exposure, for example, when the person moved out of County X or the analysis end date
exposure_type_concept_id XXXXXX (int) Identifies the origin of the exposure record (e.g. Census, EHR, Environmental data, Geospatial data, Satellite imagery, GIS mapping, Sensor network, Mobile device geolocation, LiDAR)
exposure_source_concept_id XXXXXX (varchar) A unique identifier for the exposure source value
exposure_source_value 'Person lived within X county where Y variable was greater than Z for at least 5 years' A verbatim value from the source data representing the exposure
threshold_value Z (float) Represents a numeric value that serves as a threshold for the exposure
operator_concept_id 4172704 Represents a standard OMOP concept_id to specify the type of comparison or operation being applied to threshold_value
threshold_reference 'EPA PM2.5 Limit' Provides context to threshold_value, detailing if the threshold arises from specific regulations or standards
value_as_number M (float) A numerical value representing a numeric value of the exposure provided by the source
value_as_concept_id NULL (int) A standard OMOP equivalent of an exposure source value
value_as_string NULL (varchar) A categorical value of the exposure provided by the source
unit_concept_id NULL A standard concept ID for value (in this case NULL)

Please share your thoughts, they are greatly appreciated.

cgreich commented 1 year ago

Friends:

This is very nice and thoughtful. But I would put a caveat out there:

The location of a patient, or the history of locations, is data. The polygon-related information in the GEOGRAPHY and GEOM_RELATIONSHIP tables are not data. The fact that my address is located in the polygon "City of Boston" or "State of Massachusetts" is completely independent from me living here. Rather, it is reference data. We don't store reference data in the OMOP CDM, except the vocabulary tables CONCEPT and CONCEPT_RELATIONSHIP. Those actually have that information today in the OMP vocabulary. Have you looked at that?

If you want to create standardized tables to make it easier for your use cases you would have to submit an OMOP Expansion. Those are the tables that fall out of the CDM (and its Closed-World assumption). That's what I would do.

The EXPOSURE table (I would call it ENVIRONMENTAL_EXPOSURE) is a different beast. It could store for each person the actual exposure over time. As far as the fields go:

kzollove commented 1 year ago

Thanks for thinking about this and putting it all together @p-talapova !

Some thoughts:

  1. Location and location_history tables to store information about the locations, including longitude and latitude coordinates. The location_history table must capture multiple addresses for each person_id, noting the start and end date for each residence. This ensures that even if someone relocates, all their prior locations and associated exposures are traceable.

a. 100% location_history should be in a CDM. If its not, should the GIS tools still work, but with no temporal awareness? As I am building them now, they can function w/o location_history.

b. Latitude and longitude coordinates. As the tools are being built now, there needs to be geocoded addresses in the gaiadb. We have a mechanism for geocoding (which is just the degauss tool wrapped with an R function) which allows a user to get lat/lon for each address in location table and builds a new table of POINT geometries in gaiadb. If the CDM contains lat/lon, these values are used instead of geocoding the string address. Thats all to say it is not a requirement (as of now) that sites put lat lon in the CDM.




  1. Geography table to store information about geospatial features like polygons (counties or states), lines (highways), or points (monitoring stations, care sites, or other discrete locations). Each feature should possess a unique identifier, geometry data, and rich metadata (e.g. JSON). Also, we need to add all features as a separate OMOP vocabulary.

This is different than how we've been storing spatial information in gaiadb. Notably, metadata (mostly) stays in the data catalog and functional information goes into geom_* tables, broken into tables by the time period for which they are valid (ex. geom_us_county_2011 contains data for 2011-01-01 through 2011-12-31. These dates are specified in the geom_spec, which is json in the data_source table)

image

and for reference, the data catalog (data_source) schema:

image

I've always been a little uncomfortable with which attributes end up in the data catalog vs the functional geom_* tables and am happy to discuss or argue how it should be changed.

For anyone's reference here are some diagrams to hopefully explain the relationships between the data catalog and geom_* tables. Here is the full table reference.




  1. Geom Relationship table to define the spatial relationships between geometries, specifically between point locations (like addresses) and geographic features (like regions or monitoring stations). An extended metadata field, (JSON or plain text), would be beneficial here. In addition to "Within", expand the vocabulary of geom_relationship_type to include "Adjacent to", "Intersects", "Overlaps", and other relevant types.

Hm. I haven't thought about the possible need to have a table for this. I think that the relationship type alone (which goes into the exposure_occurrence table) is the important piece.

My thinking is we don't care that patient A lives in Boston and patient B lives in Miami. We care that patient A lives in an area where percentage of households under the poverty line is 75% and patient B lives in an area where percentage of households under the poverty line is 25%. The latter gets represented in exposure_occurrence if we have the correct geom_relationship_type concept.

Maybe it is worth it to know that "an area" refers to a city in the above example, vs a state or a county. I still think that information could go into a concept and that we don't need a separate record to record every possible combination of point A in polygon B

Additionally, we need the list of all possible geom_relationship_types to be added as a separate vocabulary.

Defintely.




  1. Exposure Table to detail exposure information, considering both spatial and temporal dimensions.

Source concept template: "Person lived within X county where Y variable was greater than Z for at least 5 years"

My working exp_occurrence table is slightly different than yours:

image

But fortunately it looks like we are very much on the same page!

kzollove commented 1 year ago

Thanks for weighing in @cgreich !

The location of a patient, or the history of locations, is data. The polygon-related information in the GEOGRAPHY and GEOM_RELATIONSHIP tables are not data. The fact that my address is located in the polygon "City of Boston" or "State of Massachusetts" is completely independent from me living here. Rather, it is reference data. We don't store reference data in the OMOP CDM, except the vocabulary tables CONCEPT and CONCEPT_RELATIONSHIP. Those actually have that information today in the OMP vocabulary. Have you looked at that?

I admittedly have not looked much into the geography concepts in the vocab, but I think our general gripe with them is that they don't cover everything. Part of our work is to create a data catalog and standardized representation for staging tables. This way, a user can catalog polygon geometries for counties, states, census tracts, tribal tracts, etc, or point geometries (such as environmental monitoring stations) and then stage them and use them in spatiotemporal joins with attribute data and geocoded addresses. We haven't started working with any popular international boundary types, but our representation is extensible so we're hoping it really won't be more difficult than adding a vocabulary word for the general boundary type class, not every existing instance.

If you want to create standardized tables to make it easier for your use cases you would have to submit an OMOP Expansion. Those are the tables that fall out of the CDM (and its Closed-World assumption). That's what I would do.

Yep, that's what we're thinking for the exposure_occurrence table

jaygee-on-github commented 1 year ago

If we want to describe the relationships between geometries in general, has anyone looked at DE-9IM?

I came across this standard through schema.org which implements it here.

There is a set of temporal relationships which are largely analogous called Allen's Interval Algebra.