Closed kzollove closed 11 months ago
To represent such relationships we need:
Location and location_history tables to store information about the locations, including longitude and latitude coordinates. The location_history table must capture multiple addresses for each person_id, noting the start and end date for each residence. This ensures that even if someone relocates, all their prior locations and associated exposures are traceable.
Geography table to store information about geospatial features like polygons (counties or states), lines (highways), or points (monitoring stations, care sites, or other discrete locations). Each feature should possess a unique identifier, geometry data, and rich metadata (e.g. JSON). Also, we need to add all features as a separate OMOP vocabulary.
geography_id | geometry | metadata | valid_start_date | valid_end_date |
---|---|---|---|---|
101 | POINT(-122.425891, 37.774929) | {"description": "Air Quality Monitoring Station", "station_name": "A"} |
2018-01-01 | 2023-01-01 |
102 | POINT(-121.425891, 38.774929) | {"description": "Air Quality Monitoring Station", "station_name": "B"} |
2017-05-15 | 2022-05-15 |
103 | POLYGON(...) | {"description": "Region", "region_name": "A"} |
2015-06-10 | 2025-06-10 |
104 | POLYGON(...) | {"description": "Region", "region_name": "B"} |
2016-09-01 | 2026-09-01 |
Consider adding valid_start_date and valid_end_date time fields, especially for features that are temporal in nature, like mobile monitoring stations.
geom_relationship_id | location_id | geography_id | geom_relationship_type | metadata |
---|---|---|---|---|
201 | 1001 | 101 | Within | {"relationship_description": "Patient's residence", "feature": "monitoring station", "feature_name": "A"} |
202 | 1002 | 102 | Within | {"relationship_description": "Patient's residence", "feature": "monitoring station", "feature_name": "B"} |
203 | 1003 | 103 | Within | {"relationship_description": "Patient's residence", "feature": "region", "region_name": "A"} |
204 | 1004 | 104 | Within | {"relationship_description": "Patient's residence", "feature": "region", "region_name": "B"} |
205 | 1005 | 101 | Adjacent to | {"relationship_description": "Patient's property line", "feature": "monitoring station", "feature_name": "A"} |
207 | 1007 | 103 | Intersects | {"relationship_description": "Patient's commute route", "feature": "region", "region_name": "A"} |
209 | 1009 | 101 | Overlaps | {"relationship_description": "Patient's recreational area", "feature": "monitoring station", "feature_name": "A"} |
Additionally, we need the list of all possible geom_relationship_types to be added as a separate vocabulary.
Source concept template: "Person lived within X county where Y variable was greater than Z for at least 5 years"
Field | Example Value | Description |
---|---|---|
geom_relationship_id |
203 | A unique key given to a unique Geometry |
location_id |
54321 | A unique key given to a unique Location, in this case County X |
person_id |
12345 | A unique key given to the individual person who lived in County X |
exposure_concept_id |
XXXXXX (int) | A unique key given to the exposure record for the person/location |
omop_concept_id |
XXXXXX (int) | A unique key given to the standard OMOP equivalent exposure concept |
exposure_start_date |
2018-08-21 | The start date of the exposure, for example, when the person started living in County X |
exposure_end_date |
2023-08-21 | The end date of the exposure, for example, when the person moved out of County X or the analysis end date |
exposure_type_concept_id |
XXXXXX (int) | Identifies the origin of the exposure record (e.g. Census, EHR, Environmental data, Geospatial data, Satellite imagery, GIS mapping, Sensor network, Mobile device geolocation, LiDAR) |
exposure_source_concept_id |
XXXXXX (varchar) | A unique identifier for the exposure source value |
exposure_source_value |
'Person lived within X county where Y variable was greater than Z for at least 5 years' | A verbatim value from the source data representing the exposure |
threshold_value |
Z (float) | Represents a numeric value that serves as a threshold for the exposure |
operator_concept_id |
4172704 | Represents a standard OMOP concept_id to specify the type of comparison or operation being applied to threshold_value |
threshold_reference |
'EPA PM2.5 Limit' | Provides context to threshold_value, detailing if the threshold arises from specific regulations or standards |
value_as_number |
M (float) | A numerical value representing a numeric value of the exposure provided by the source |
value_as_concept_id |
NULL (int) | A standard OMOP equivalent of an exposure source value |
value_as_string |
NULL (varchar) | A categorical value of the exposure provided by the source |
unit_concept_id |
NULL | A standard concept ID for value (in this case NULL) |
Please share your thoughts, they are greatly appreciated.
Friends:
This is very nice and thoughtful. But I would put a caveat out there:
The location of a patient, or the history of locations, is data. The polygon-related information in the GEOGRAPHY and GEOM_RELATIONSHIP tables are not data. The fact that my address is located in the polygon "City of Boston" or "State of Massachusetts" is completely independent from me living here. Rather, it is reference data. We don't store reference data in the OMOP CDM, except the vocabulary tables CONCEPT and CONCEPT_RELATIONSHIP. Those actually have that information today in the OMP vocabulary. Have you looked at that?
If you want to create standardized tables to make it easier for your use cases you would have to submit an OMOP Expansion. Those are the tables that fall out of the CDM (and its Closed-World assumption). That's what I would do.
The EXPOSURE table (I would call it ENVIRONMENTAL_EXPOSURE) is a different beast. It could store for each person the actual exposure over time. As far as the fields go:
Thanks for thinking about this and putting it all together @p-talapova !
Some thoughts:
- Location and location_history tables to store information about the locations, including longitude and latitude coordinates. The location_history table must capture multiple addresses for each person_id, noting the start and end date for each residence. This ensures that even if someone relocates, all their prior locations and associated exposures are traceable.
a. 100% location_history should be in a CDM. If its not, should the GIS tools still work, but with no temporal awareness? As I am building them now, they can function w/o location_history.
b. Latitude and longitude coordinates. As the tools are being built now, there needs to be geocoded addresses in the gaiadb. We have a mechanism for geocoding (which is just the degauss tool wrapped with an R function) which allows a user to get lat/lon for each address in location table and builds a new table of POINT geometries in gaiadb. If the CDM contains lat/lon, these values are used instead of geocoding the string address. Thats all to say it is not a requirement (as of now) that sites put lat lon in the CDM.
- Geography table to store information about geospatial features like polygons (counties or states), lines (highways), or points (monitoring stations, care sites, or other discrete locations). Each feature should possess a unique identifier, geometry data, and rich metadata (e.g. JSON). Also, we need to add all features as a separate OMOP vocabulary.
This is different than how we've been storing spatial information in gaiadb. Notably, metadata (mostly) stays in the data catalog and functional information goes into geom_* tables, broken into tables by the time period for which they are valid (ex. geom_us_county_2011 contains data for 2011-01-01 through 2011-12-31. These dates are specified in the geom_spec, which is json in the data_source table)
and for reference, the data catalog (data_source) schema:
I've always been a little uncomfortable with which attributes end up in the data catalog vs the functional geom_* tables and am happy to discuss or argue how it should be changed.
For anyone's reference here are some diagrams to hopefully explain the relationships between the data catalog and geom_* tables. Here is the full table reference.
- Geom Relationship table to define the spatial relationships between geometries, specifically between point locations (like addresses) and geographic features (like regions or monitoring stations). An extended metadata field, (JSON or plain text), would be beneficial here. In addition to "Within", expand the vocabulary of geom_relationship_type to include "Adjacent to", "Intersects", "Overlaps", and other relevant types.
Hm. I haven't thought about the possible need to have a table for this. I think that the relationship type alone (which goes into the exposure_occurrence table) is the important piece.
My thinking is we don't care that patient A lives in Boston and patient B lives in Miami. We care that patient A lives in an area where percentage of households under the poverty line is 75% and patient B lives in an area where percentage of households under the poverty line is 25%. The latter gets represented in exposure_occurrence if we have the correct geom_relationship_type concept.
Maybe it is worth it to know that "an area" refers to a city in the above example, vs a state or a county. I still think that information could go into a concept and that we don't need a separate record to record every possible combination of point A in polygon B
Additionally, we need the list of all possible geom_relationship_types to be added as a separate vocabulary.
Defintely.
- Exposure Table to detail exposure information, considering both spatial and temporal dimensions.
Source concept template: "Person lived within X county where Y variable was greater than Z for at least 5 years"
My working exp_occurrence table is slightly different than yours:
But fortunately it looks like we are very much on the same page!
Thanks for weighing in @cgreich !
The location of a patient, or the history of locations, is data. The polygon-related information in the GEOGRAPHY and GEOM_RELATIONSHIP tables are not data. The fact that my address is located in the polygon "City of Boston" or "State of Massachusetts" is completely independent from me living here. Rather, it is reference data. We don't store reference data in the OMOP CDM, except the vocabulary tables CONCEPT and CONCEPT_RELATIONSHIP. Those actually have that information today in the OMP vocabulary. Have you looked at that?
I admittedly have not looked much into the geography concepts in the vocab, but I think our general gripe with them is that they don't cover everything. Part of our work is to create a data catalog and standardized representation for staging tables. This way, a user can catalog polygon geometries for counties, states, census tracts, tribal tracts, etc, or point geometries (such as environmental monitoring stations) and then stage them and use them in spatiotemporal joins with attribute data and geocoded addresses. We haven't started working with any popular international boundary types, but our representation is extensible so we're hoping it really won't be more difficult than adding a vocabulary word for the general boundary type class, not every existing instance.
If you want to create standardized tables to make it easier for your use cases you would have to submit an OMOP Expansion. Those are the tables that fall out of the CDM (and its Closed-World assumption). That's what I would do.
Yep, that's what we're thinking for the exposure_occurrence table
If we want to describe the relationships between geometries in general, has anyone looked at DE-9IM?
I came across this standard through schema.org which implements it here.
There is a set of temporal relationships which are largely analogous called Allen's Interval Algebra.
The relationships we need to represent are those between point locations (addresses, either of patients or care sites) and locations of features (typically polygons, lines or points; example being a county or state as a polygon, a highway as a line, or a point as a monitoring station, care site, or other discrete location)
Example: "point is within polygon" would be a single relationship we will need to represent. This concept would be used to represent relationships such as "person lived within a county where percentile percentage of poverty estimate is greater than 0.75" or "person lived within a county where PM2.5 was greater than 10".
A second layer to this ask is temporal relationships: "person lived within X county for at least 5 years" "person lived within X county where Y variable was greater than Z for at least 5 years"
This may be a rabbit hole of a problem. For the time being, focus on defining spatial relationships that are pertinent to the October Symposium use cases.