pavgra commented 5 years ago

Table care_site contains location_id field while rules for location_history also say that care_site can be a target domain:

The entities (and permissible domains) with related locations are: Persons (PERSON), Providers (PROVIDER), and Care Sites (CARE_SITE).

Therefore we get doubled link which leads to inconsistency: which location is true?

pavgra commented 5 years ago

The same story is with person. But the provider table looks good. No location_id there but it can be referenced from location_history

gowthamrao commented 5 years ago

@rtmill proposed the table with discussions here https://github.com/OHDSI/CommonDataModel/issues/181

I think the idea is that current location of person, provider and care_site are in location table. The historical location of person, provider and care_site are in location_history table.

i.e. if a person changes address: current address (on ETL) is in person table. The historical addresses are in location_history table with date spans (start_date and end_date).

gklebanov commented 5 years ago

I also think that the associations between person, care_site and provider is another time bound fact and these are changing throughout person's life. These should not be direct links between person and care_site and provider tables via FKs but rather through another fact table.

gowthamrao commented 5 years ago

@gklebanov @pavgra

Would this make it less ambiguous?

Field	Required	Type	Description
location_history_id	Yes	integer	A unique identifier (primary key) for each location_history.
entity_id	Yes	integer	The unique identifier for the entity. A foreign key that references either person_id, provider_id, or care_siteid, depending on location.
location_entity_concept_id	Yes	varchar(50)	A foreign key identifier to a concept in the CONCEPT table representing the identity of the field represented by LOCATION_ENTITY_ID. Either PERSON, PROVIDER, or CARE_SITE.
location_id	Yes	integer	A foreign key to the location table.
relationship_type_concept_id	Yes	varchar(50)	The type of relationship between location and entity.
start_date	Yes	date	The date the relationship started.
end_date	No	date	The date the relationship ended.

Changed

domain_id to location_entity_concept_id
added location_history_id as pk
changed some descriptions, to explicitly say entity_id is 'foreign key'
restructured the tables field positions

@clairblacketer @cgreich i think we should make these changes. These are simple changes and may be considered corrections

pavgra commented 5 years ago

@gowthamrao , @gklebanov , location_entity_id and location_entity_concept_id should be replaced with separate many-to-many tables to simplify SQL creation and use native RDBMS mechanisms to enforce data integrity

rtmill commented 5 years ago

@pavgra @gklebanov @gowthamrao

Table care_site contains location_id field while rules for location_history also say that care_site can be a target domain:

The entities (and permissible domains) with related locations are: Persons (PERSON), Providers (PROVIDER), and Care Sites (CARE_SITE).

Therefore we get doubled link which leads to inconsistency: which location is true?

The location history table stores relationships to locations from person, care site and provider. The location_id field in each of those tables was left intentionally as a legacy structure, as it's common for implementations to use the field to represent current, or rather last known, location.

I would think the relationships of person to care site and person to provider can be inferred from visits. Are there specific use cases in mind? The only thing that occurs to me is changes to PCP over time.

gklebanov commented 5 years ago

Hi Robert,

The location_id field in each of those tables was left intentionally as a legacy structure, as it's common for implementations to use the field to represent current, or rather last known, location.

Since these are facts, why would this be any different from other facts such as drug exposure, conditions, observations etc..? It would be the same as saying let's store "last/current drug taken by a person" or "last/current observed condition". These facts can be obtained from fact tables, including the last known provider or care_site.

Don't get me wrong, I understand why people are trying to do that and that these are less volatile facts but these are facts nonetheless and I would prefer consistency vs. potential data inconsistency issues.

gowthamrao commented 5 years ago

@gklebanov are you advocating for removing the location_id from person, care_site, provider tables?

@pavgra could you please illustrate this point in a diagram. We are not educated in the theory of relational databases - so we will need some help understanding.

location_entity_id and location_entity_concept_id should be replaced with separate many-to-many tables to simplify SQL creation and use native RDBMS mechanisms to enforce data integrity

we get doubled link which leads to inconsistency:

rtmill commented 5 years ago

@gklebanov I share your concern and agree this is something that should be addressed. Specifically, the ohdsi tendency to force many-to-many relationships into one-to-one representations, which was touched on in (point 2) this post and was the motivation for proposing the location_history table.

Perhaps I'm too influenced by @cgreich but it doesn't seem reasonable to adapt the location_history structure to accommodate these relationships if there are no use cases for them, given the added complexity from an ETL perspective and that the content can typically be inferred from visits.

I would lean towards either further specifying these fields to logically force one-to-one (e.g. Person.primary_provider_id, Provider.primary_care_site_id) or in some cases remove them altogether (what is Person.care_site_id used for?).

pavgra commented 5 years ago

@gowthamrao ,

cgreich commented 5 years ago

@pavgra and @gklebanov:

Nominally you are right. There is no good reason to model the same thing twice, once as the "current" in the PERSON table, and another one through linking to the LOCATION_HISTORY table. While we are not doing that with the other event tables, there, everything is in the equivalent of the History table.

Except:

There really is a one-to-many relationship to Condition, Drug etc. And I mean MANY MANY. Most data have only one location, making it a de-facto one-to-one. Many have none.
Location is much less fluid than the clinical events. These change every day or even within a day. Locations change in years.
The consequences of information duplication is little. If somebody wants the geographic distribution today they will use PERSON, if they want to study the past or dynamics of moving around they will use the History table and union it to the location in PERSON. Works.

Bottom line: You are right, but making your change to get it right is much more costly than keeping it dirty, and no use cases suffer, which is our ultimate criterion. Cleanliness is not.

pavgra commented 5 years ago

@cgreich , I clearly get why proposed many-to-many links cost more than give, but what costs so much with dropping two columns from person and care_site? (in the way, which @gowthamrao described)

cgreich commented 5 years ago

@pavgra: Any tool/method/cohort definition would have to change.

pavgra commented 5 years ago

@cgreich , That's why it is called a major release. To bring in breaking changes. But the changes are clearly reasonable and valuable.

cgreich commented 5 years ago

Know what? Bring it on. Put out a CDM proposal, and we let the crowd decide. (Except the folks who come to that WG are ideologs, and might outweigh the common sense. I need a "directed democracy" there. :) )

cgreich commented 5 years ago

Or is this the proposal already?

pavgra commented 5 years ago

@cgreich, I believe, @gowthamrao 's post + explicit statement that location_id columns in person and care_site should be dropped = the proposal

cgreich commented 5 years ago

Right, but that's not debatable and votable at the WG. You need to make a proposal, explain it, show the use case or reason and pros/cons. Can't throw it over the fence, particularly if it is not straightforward.

pavgra commented 5 years ago

Removal of `location_id` column from `person` and `care_site`

Relevant tables:

person
care_site
location
location_history

Issue

There are bi-directional associations between person / care_site and location tables:

person.location_id → location ← location_history.location_id & location_history.entity_id → person
care_site.location_id → location ← location_history.location_id & location_history.entity_id → entity_id

Therefore there are two ways to get current location of person / care site: either get by location_id field or calculate via location_history.entity_id, and these two ways can give different results. Since location is a time bound fact and these are changing throughout person's life (also can change for care_site), there can be multiple locations per person / care_site. Therefore we have 1-to-many relationship, which should be implemented by foreign key in location_history table (it was already done via entity_id) and the second way of linking location to person / care_site should be removed to exclude issues with data inconsistency and have only a single source of truth.

Proposed solution

Drop fields:

person.location_id
care_site.location_id

Pros

Data consistency

Cons

Additional join to query current location

@cgreich , is this sufficient or anything else is required?

cgreich commented 5 years ago

No pros and cons, no use cases, but fine. Will bring it on.

gklebanov commented 5 years ago

Friends,

I would propose the following changes:

remove care_site_id from person
remove provider_id from person
remove location_id from person
merge location and location history into one table and rename it to be location
have a one-to-one...many relationship between person, care_site, provider and location(history)

Simple and practical. It would simplify things but yes - I know - many might initially struggle with this. And yes - it is a change.

Person, Care Site or Provider NEVER exist without an address and the opposite is definitely true. In modeling, it is called the "composition". Unless you are a real estate builder- the address would never exist by itself. Moreover, who cares if you have multiple address entries in this table (get rid of Rule 1) - it is just the address stamp, think White Book.

So, to determine the LATEST address, provider or care_site - someone would just literally get the last record. Not index required, just get all and sort by date and use the latest one

here is the pathetic PowerPoint based model;)

gowthamrao commented 5 years ago

What is the PK for the location_history table you are proposing? Where are the lat/long/street_name_1/city/county stored?

pavgra commented 5 years ago

Additionally to @gklebanov 's proposal I'd like to add that getting current location can be as simple as WHERE end_date = NULL. And as location & location_history are merged, there is even no need in additional join

gklebanov commented 5 years ago

@gowthamrao - here is my thinking:

In my view, the PK should be a system generated ID (location_id) and entity_id would be the "FK" to person, provider or care_site

The "location" - or, simply the physical address - is what is called a composite part of the person (or care_site, provider). It is not the first grade object in OMOP CDM model. What is important is that we track the addresses where those entities reside or resided.

We are not the US Census Bureau and thus do not need to maintain a perfectly clean address book, thus this rule should not even be there

"Each address or Location is unique and is present only once in the table."

Otherwise it will lead to a complex ETL process that has to cleanse every location, do fuzzy matching etc.. just to match a physical address to a person. Never mind, I doubt that this information is clean in raw or even always present. Then, that would also imply that we need to do the Many-Many relationships - just too complex and I do not think is needed for OMOP CDM use cases.

rtmill commented 5 years ago

The convention that you have issues with "Each address or Location is unique and present once only in the table" was meant as a conceptual convention for efficiency purposes and should not break anything if you ignore it.

remove care_site_id from person

remove provider_id from person

remove location_id from person

All for it. I could see an eventual need for a provider_history table to record changes to PCP over time, if that's a use case people have.

merge location and location history into one table and rename it to be location

Strongly against this.

Person, Care Site or Provider NEVER exist without an address and the opposite is definitely true. In modeling, it is called the "composition". Unless you are a real estate builder- the address would never exist by itself. Moreover, who cares if you have multiple address entries in this table (get rid of Rule 1) - it is just the address stamp, think White Book.

In the real world sure, but I've encountered countless sources of person, care site and provider without address data. The argument that a location does not exist without reference to one of these domains is the biggest difference in our approach.

Reasoning for leaving the two tables separate:

A location is not a relationship, it is a distinct entity that exists outside of how it is referenced or related to clinical data. In its essence, it is a coordinate pair that is a point on the globe.
Storing every relationship to a location as a location itself causes duplication and inefficiency
- Establishing equivalence between two geocoded locations is not difficult
  - IF (a.lat == b.lat & a.lon == b.lon) THEN equivalent
- Arguably the most common operation in this realm will be to translate between person and region. For bringing spatial data into the CDM, this has two main parts:
  - a) For a given region (polygon), find me all of the locations (points) within it
  - b) For this set of locations, find all the people that lived there at a given date
- The spatial join (point-in-polygon), in part a, is likely to be the most expensive operation. Forcing that operation on every relationship rather than the subset of unique locations is inefficient. Furthermore, part b is simple when indexes on the location_id within the location_history table are created
Regions and locations have attributes, independent of the CDM
- Region examples : poverty rate of county, air quality of tract, provider shortage area, etc
- Location examples: paper mill exists at [x,y], arsinic measurement for well water at [x,y] (this was the original use case for the GIS WG)
- Point being, while this isn't needed for your use case, the ability to independently reference locations, outside of the context of clinical data and how they relate to it, is neccessary. This is even more relevant with the functionality being developed by the metadata and annotations work group. @alondhe Correct me if I'm wrong but I believe we'll have the ability to annotate and create metadata for locations, which would need to be replicated for every relationship in this proposed model

cgreich commented 5 years ago

Friends:

I want to reopen the debate. Because it is so much fun (!) and because I changed my mind after some hardcore hairwashing from Greg. The problem was that we were constantly talking two simultaneous issues, and didn't really listen to each other. At least I might not have. They were:

Duplication of information
Denormalization or normalization of the locations

The advantage of @gklebanov and @pavgra's proposal is simplicity. We'd have one location table, and could phase out the location_id in the PERSON table over time. The disadvantage is what @rtmill points out. Essentially:

Assigning locations to polygons or region is compute-expensive, and you don't want to do that for the same location over and over again
Comparing locations for identity or calculating distances will be expensive for the same reason.

Here is a compromise: We take the denormalized approach @gklebanov sketched out, but for the compute-intensive stuff we can have an additional normalized reference table. The good thing about that is that all non geo-specific use cases it will be fast. And the little bit of space for repeating the same 3-letter zip - I don't care. Storage is cheap.

Thoughts?

rtmill commented 5 years ago

@cgreich I'm not sure I understand.

Assuming you're talking about adopting @gowthamrao 's new proposal, which I think makes sense, what else would change? Removal of location_id from person? If you're proposing representing locations as relations rather than distinct entities please elaborate.

cgreich commented 5 years ago

@rtmill:

No, no. @gowthamrao talks about the FACT_RELATIONSHIP table. He claims that the LOCATION_HISTORY table is kind of the same thing and we don't need it twice. But I am not talking about that. I am talking about the proposal @gklebanov threw in (above with the diagram of Person, Care Site, Provider and Location). It would be a one to one relationship of Locations to those three domains, with a person_id, care_site_id and provider_id in the LOCATION table. You had issues with that, because it would denormalize that table. The same 3-letter zip would be there a gazillion times, one for each patient, hospital and provider located in that 3-letter zip. So, I understand why you hate it, but I am proposing to solve your problem with an additional normalized LOCATION_REFERENCE or so table which would shrink the repetition down to what really is in the data.

We are really pressing the balloon: If we are trying to be efficient in one place we increase the problem in another, it seems. I am just trying to come out with the optimum.

rtmill commented 5 years ago

@cgreich

I changed my mind after some hardcore hairwashing from Greg

We are really pressing the balloon

I can't tell if you're making these up as you go or if you have an endless supply of obscure metaphors 😄

It would be a one to one relationship of Locations to those three domains, with a person_id, care_site_id and provider_id in the LOCATION table ...an additional normalized LOCATION_REFERENCE or so table which would shrink the repetition down to what really is in the data.

It's not obvious to me what advantages introducing one-to-one relationships and a LOCATION_REFERENCE table would have over LOCATION_HISTORY (and assumption of unique locations in LOCATION). You would still be replicating the static data of a location, where it exists on the globe.

Six of one, half dozen of the other. I'd be happy as a clam if you can shine some light on this 😄

pavgra commented 5 years ago

Let me also palp the balloon a bit 😄

While I've been hashing the geo polygon topic with @cgreich, following came to my mind:

A need for a hierarchy of areas (locations) is sounded both here and there. What is more, such hierarchy fits OMOP vocabularies really well. You have concepts (name of area and its type, valid start / end dates), concept types (city / county / country / etc), you have relations (city is in a country, etc), you have ancestry. So why to invent new tables while this can be easily put into existing polished vocabs mechanism? In this way there is no need for additional LOCATION_REFERENCE table, you just organize e.g. OSM data as an OMOP vocabulary and add admin_area_concept_id field to the location table. And the admin_area_concept_id could be filled e.g. during ETL process by calculating instercestion of the lat-lon from the location table and areas' polygons (e.g. taken from the same OSM). @cgreich , your thoughts?
We drop "location_id" field from "person" and "care_site" to make things non-ambiguous and apply @gklebanov 's proposal to reduce amount of tables (although it lacks normalization, it should be compensated by the geo vocabs and the linkage described above)

cgreich commented 5 years ago

@pavgra:

Looks like a thought through idea. Makes a lot of sense. But let's ask @rtmill and his geo friends: Would that work? And in particular: Do you think you can prespecify the polygons/regions? So we can put them in the vocab? Just as concepts? All the calculations and geometrical definitions would have to reside outside the CDM.

pavgra commented 5 years ago

Do you think you can prespecify the polygons/regions? So we can put them in the vocab? Just as concepts?

Why not? I'm looking at OSM database and see no issues since the administrative area name combined with hierarchy gives us uniqueness if I don't miss anything. Also, we can fill source_codes with IDs from OSM

All the calculations and geometrical definitions would have to reside outside the CDM.

Results schema should be good enough for that

rtmill commented 5 years ago

@rtmill and his geo friends:

Is that what they're calling us these days? 😄

@pavgra

This is something that has been discussed at length. See: http://forums.ohdsi.org/t/themis-topic-location-table-non-u-s-address-locations/3828/17 And : https://github.com/OHDSI/CommonDataModel/issues/91#issuecomment-370938656

In short, the way we've been thinking about storing locations that have limited data (e.g. only postal code) is to think of that location as 'a point somewhere within the region'. If the location record has coordinates, do a point-in-polygon calculation to find the region. If it only has zip, use an alternate matching approach (R has lots of great packages to go from postal code to region).

The idea of turning the different political regions into concepts is attractive and something we have also discussed. The problem is, the regions that are of interest are not always standardized and can be unique to the implementation. For instance, it wouldn't be practical to uniquely identify and store the specific hospital service areas for every health care organization. Say an investigator has geographic data on local opioid hot spots, etc. Point being, there is a need to allow the user to integrate their own custom polygons that can be displayed alongside political boundaries. Also, the definitions of regions change over time, the polygon for a given zip code can be different from one year to the next as they adjust the boundaries, not to mention numerous (for US alone~73,000 census tracts, ~42,000 zip codes, etc).

pavgra commented 5 years ago

The problem is, the regions that are of interest are not always standardized and can be unique to the implementation

Same as different medical vocabularies.

Also, the definitions of regions change over time, the polygon for a given zip code can be different from one year to the next as they adjust the boundaries, not to mention numerous

E.g. OSM provide updates on a regular basis and it's not hard to incorporate geo vocabs updates based on the OSM updates in the same way as it is done for existing medical vocabs.

Point being, there is a need to allow the user to integrate their own custom polygons that can be displayed alongside political boundaries.

I would propose to provide a user downloading geo vocabs with polygons table to allow doing all the spatial stuff he / she wants. But this polygon data, as @cgreich mentioned, should reside in results schema.

All in all, @rtmill , are you in support of my proposal?

cgreich commented 5 years ago

@rtmill:

I can't tell if you're making these up as you go or if you have an endless supply of obscure metaphors

That's when you use English as a second (or subsequent) language. You translate the metaphors you are familiar with. Which, of course, doesn't work, as I found out. But then - I realized it actually does work, because people go "What? What did you just say?" You get the full attention. The more animals in the metaphors, the better. :)

It's not obvious to me what advantages introducing one-to-one relationships and a LOCATION_REFERENCE table would have over LOCATION_HISTORY (and assumption of unique locations in LOCATION). You would still be replicating the static data of a location, where it exists on the globe. Six of one, half dozen of the other.

Exactly. Or "If you squish the balloon on one end...". Except: you need the LOCATION_REFERENCE only if you are doing these heavyweight calculations you guys are planning at analysis time. If you don't care because you have a big fat machine, or all you ever want to assign the locations to regions once at ETL times, you could drop it. Another argument is you say there are 70,000 polygons/regions, but there are tens or hundreds of millions of patients, so taking an extra one-to-many away from then and pushing them to the locations is a good idea.

The problem is, the regions that are of interest are not always standardized and can be unique to the implementation

Yep. That's what I was afraid of and what makes the conceptualization in the vocabulary a problem. But the question is how dynamic they are. This is also what @pavgra is asking. To have your own regions - no problem, everybody can have their own vocabs. But if they are like cohorts, and there is no way to standardize them, and they are popping up anew in any analytic run, that would be problem. Hospital service areas and opioid hot spots sound like the former. Are they?

It would be nice if it worked that way. The regions themselves would stop being data, but become reference data, and we already have tables for that, and we have a full machinery to apply proper lifecycle (for example Germany now one, was two). And we could publish nice standardized tools for the standard regions everybody uses.

Let us know.

pavgra commented 5 years ago

I feel that adding some diagrams and concrete examples would make it easier to understand my proposal, so here they are:

Changes summary:

Removed location_id from person and care_site
Merged location_history into location
Added geo vocabulary (only subset presented at diagram) and foreign key to the most specific admin area for a location - location.admin_area_concept_id

Notes:

City, State, Zip, County, Country stored in location could be derived from vocabularies so they could be dropped to remove redundancy, but it seems to be too big shift for now and too big headache for people who wants to grab whole address of a person, so the set of fields was left

rtmill commented 5 years ago

All in all, @rtmill , are you in support of my proposal?

I am not. The second link in my previous post contains an argument directly against this but I will reiterate here.

As @cgreich has stated, the relationship from location to region is not one-to-one. Forcing one-to-many relationships into one-to-one representations keeps biting us and, ironically, is at least partially the reason this issue thread was started. Which region type is assigned? What if some patients have zip and some have tract and some have hospital service area? It's messy
Regions do not always roll up cleanly. For a single region (e.g. hospital service area) you can overlap with a many regions in another layer (e.g. Census tract) and in varying proportions of overlap. Imagine your polygon intersects two tracts in varying levels of overlap, how would this be stored within the concept_relationship? Relationship_id = 'Overlap with region b. intersection 42% of a, 78% of b' ? Point being, a single reference to a polygon is not sufficient to be able to translate between the various layers which I'd argue defeats its purpose
Due to the limitations described in the previous point, I'd argue we should rely on geospatial joins as the source of our information on location-to-region and region-to-region relationships, including the degree of overlap
I believe a cleaner solution would be the assumption that the CDM is ignorant of regions. No region concepts, relationships, hierarchies, polygons, etc. We can keep the CDM clear of all of that since the LOCATION table provides the information (coordinates/zip) needed to derive region relationships externally
Given everything above, what would be gained from putting each unique region into concepts? It makes adding institution-specific regions more difficult, the relationships can not be properly represented in the current CDM, and, functionally, it is not necessary. That said, I could see it being useful if we eventually decide to swap out the fields in LOCATION to concepts (e.g. country_concept_id, etc)

To have your own regions - no problem, everybody can have their own vocabs. But if they are like cohorts, and there is no way to standardize them, and they are popping up anew in any analytic run, that would be problem. Hospital service areas and opioid hot spots sound like the former. Are they?

More like the former. Let's stick with the external opioid data example, in what circumstance would you a concept_id be handy? With one polygon field per location, you probably would have something more universal there. If you are producing a visualization, great, the data is already at the region level. If you are looking to do person-level analysis, import the data as person-level exposures (e.g. "Lived in high opioid usage area") which does not require an external reference to the region which the data was derived from.

pavgra commented 5 years ago

@rtmill ,

we should rely on geospatial joins

Get a look at my report on geospatial capabilities in the AEGIS issue. Not sure that it would be cool to say to half of the community that new functionality is not supported for their DBs and drop Atlas / SqlRender support for them. So we cannot really rely on spatial joins and functions (although I personally like PostGIS a lot)

what would be gained from putting each unique region into concepts? and, functionally, it is not necessary

I would not suggest smth just for fun. In the AEGIS issue, there is a requirement to implement geo criteria and region hierarchy browsing. That's why I proposed to introduce geo vocabs - not to re-invent the wheel but use established and well-tested mechanism of vocabularies, which would allow using relations, hierarchy, etc...

Relationship_id = 'Overlap with region b. intersection 42% of a, 78% of b'

If we talk about administrative areas, could you give an example of e.g. county which simultaneously appears in two states? If you talk about a hospital, it is a point (this is a record in location which characterized by lat-lon) and the point also can be assigned to only one administrative area. But if you want some custom logic - it's up to you how to organize such stuff (and I believe, vocab team faces similar 42%-78% issue in ATC vocab, which doesn't stop them from using the vocabs structure; maybe @cgreich can give an advice here)

We can keep the CDM clear of all of that Which region type is assigned?

Again, the proposal hasn't contained the sentence "let's drop all geo fields from the location table". It included just a single suggestion to add a concept_id field which would point to the lower level administrative area associated with the point. Again, taking a look at existing use-case - a single drug_exposure fact could also be associated with a lot of concepts from different medical vocabularies (you have multiple vocabs, therefore concepts, for a single drug). But what you do is you pick a standard concept and can then derive concept from any other vocab using concept_relationship.

rtmill commented 5 years ago

There are two options - deploy additional tool next to a source database to process queries like does the geometry contains point? (we cannot deploy such tool next to WebAPI because it will mean transferring of data out of the database which may break security constraints)

Ok! This makes a lot more sense now. I think if we figure this out we will be a lot closer to being on the same page. There has been an assumption in the GIS WG that we will be standing up a separate PostGIS database to handle all of this. Could you elaborate on the security constraints? For instance, how does this vary from other tools that pull from db like Achilles? Because it passes data into another db? I've been imagining another row in the SOURCE table with reference to the PostGIS db. On top of that we could preload and distribute an instance of a PostGIS db with required packages and functions (notably local geocoding).

If we talk about administrative areas, could you give an example of e.g. county which simultaneously appears in two states?

For that specific example, from the Census: "At the state level, most ZIP Codes deliver wholly within the state, but a few do deliver to out-of-state areas. At the county level, some ZIP Codes cross county boundaries, but most deliver wholly within the county. The ZIP Codes that are split by state or county, however, pose problems for coding by ZIP Code."

But the much more common case would be non-political boundaries. Yes, a hospital is a point, however, hospital service areas and provider shortage areas, for example, are polygons. Here's an example from dartmouthatlas with an HSA in multiple states:

The extent of overlap between the polygons is needed to interpolate regional values.

pavgra commented 5 years ago

@rtmill ,

most ZIP Codes deliver wholly within the state, but a few do deliver to out-of-state areas.

First of all, thanks, haven't known that. But, does it really affect the proposed stuff? - I was talking about administrative areas all the way. Do we need to care about ZIPs?

There has been an assumption in the GIS WG that we will be standing up a separate PostGIS database to handle all of this.

Let me repeat smth from the AEGIS issue. There are two options:

either use built-in DB spatial functions - but those are not supported by all DBs which are used by SqlRender (so by OHDSI)
or use separate component for geo calculations; no matter whether it is R or PostGIS or whatever, it 1) needs to be located next to CDM so that we don't transfer data over network, 2) some orchestration module would be required to organize interaction between CDM and the geo component

Neither one nor another sound very attractive. That's why we thought: why not to precalculate relations between person and admin areas in advance (e.g. during ETL)? Then you'll have foreign keys between person's location and administrative area, and won't need the complexity described above. Just run regular queries.

rtmill commented 5 years ago

The external (but local) database option is particularly attractive to us for a handful reasons:

Keep GIS structures outside of CDM
We need a local geocoder (assumption that PHI never leaves server)
We are going to build a tool to pick and choose which regional attributes you want to import to the person level. Without this, we'd have to do a straight ETL which would flood the exposure tables with unwanted data. To enable this, we need a staging DB to store the source data (ACS, EPA, NOAA, etc)

I've been thinking of the whole GIS functionality as like an optional OHDSI module or extension. Perhaps a call would help figure this out more efficiently?

pavgra commented 5 years ago

I think at this point, yes, it would be much more efficient to continue the discussion by voice

pavgra commented 5 years ago

Guys, let me try to continue the discussion, and first lay out what we (seems like) agreed on during the call (please correct me if I am wrong):

We don't want to turn OMOP CDM into GIS database so polygons and spatial functions should be outside and shouldn't be added to CDM
We do want to build cohorts using geo criteria, but we are squeezed between two limitations:
- we cannot apply geo criteria using spatial SQL since that is not supported by all community-used DBs;
- we cannot apply geo criteria as a separate step with some external component as we have cases with a very tight coupling of regular and geo criteria (e.g. we need to check whether administrative area of care site for certain visit occurrence fits given list of admin. areas; so we need to derive area from visit_occurrence, but the visit_occurrence_id wouldn't be available as result of built cohort - therefore, we wouldn't be able to use it in the second separate step)
- therefore, we go for pre-calculation of relations
point to polygon relations are many-to-many (not many to one) At least, due to possiblity to have areas represented by different coding systems and containing the same point
we are going to build geo vocabularies to use as geo reference data: concepts for areas, concept relationships and ancestor for their relations

Do we all agree to the list?

tagging @cgreich , @rtmill , @gowthamrao , @gklebanov

pavgra commented 5 years ago

Having said the things above, I want to put updated proposal from my side:

All of geo criterias for either cohort build or cohort characterization can be split into two groups:

Based on distance
Based on areas

Therefore, we need pre-calculation of following geo relations:

Location’s regions
- Associations of locations with regions (point in polygon) can be pre-calculated and stored in many-to-many relationship table - LOCATION_AREA
Distance between locations
- We need to know distance between a person and care sites. However, pre-calculation of all such relations would produce a cartesian product which, keeping in mind sizes of real life datasets, is too big amount of data to handle. Therefore, there is a suggestion to calculate distances between a person and all care sites located within certain range (e.g. 100 miles) from a person (making assumption that a person wouldn’t use care site located in 1000 miles from his / her location) and put results into LOCATION_DISTANCE table.

The tables for storing pre-calculated relations could be put into Derived tables section of OMOP.

After we have pre-calculated the relations above, we just add necessary joins using FKs into Cohort Definition SQL.

@cgreich / @rtmill , your thoughts?

gowthamrao commented 5 years ago

so polygons and spatial functions should be outside and shouldn't be added to CDM we are going to build geo vocabularies to use as geo reference data: concepts for areas, concept relationships and ancestor for their relations

osm as a concept?

but the visit_occurrence_id wouldn't be available as result of built cohort - therefore, we wouldn't be able to use it in the second separate step

@chrisknoll and I talked about this. We thought, given a use case - we might want to explore persisting qualified_events or inclusion_events (or recalculating them using a separate process)

Location_area, location_distance : I am just worried that the denormalized version will have significant data redundancy, maybe 1000+ times. In location_distance : recommend pre computing distance between every unique combination of location_id from person table and location_id from care_site table.

pavgra commented 5 years ago

@gowthamrao ,

osm as a concept?

yes, e.g. we could built OSM vocabulary where concept name would be equal to name field from planet_osm_polygon table

We thought, given a use case - we might want to explore persisting qualified_events or inclusion_events

Anyway this is not the one corner case which makes it too hard to split Cohort Definition and geo criteria calculation into separate steps. What is more, those are logically coupled, it would be not right to split them from applied perspective.

recommend pre computing distance between every unique combination of location_id from person table and location_id from care_site table.

Yes, this is the way I see it (see those fields marked in bold as PK)

gowthamrao commented 5 years ago

@pavgra I agree with this design, but want to make sure the precomputed tables don't get too big, does not have redundancy

gowthamrao commented 5 years ago

48030341-bc58ba00-e11e-11e8-8a52-76dd39700bc4__01

@pavgra @gklebanov @cgreich Could you please elaborate the rationale for denormalized location table being proposed, I know it is to combine the location and location_history - but am worried if this is the right design.

Here are some thoughts: does the table support all of this?

Location_id should only apply to person and care_site. Providers are nested within care_site, and care_sites have location_id I.e. provider_id should not need a direct link to location_id. So, why not create two tables (care_site_location and person_location) having proper referential integrity?
Location_id should be a unique street address. i.e. ideally we should follow the following steps

Take source address.
Standardize it's representation by using address standardization algorithms (e.g. how postal service does it).
Geocode the standardized address to lat and long.
Assign every unique combination of lat and long with a location_id. I.e. lat and long is natural key of a location table.

Multiple person's may have the same location_id (people living in the same house). In this teacher, it looks like each unique combination of person-domain-lat-long gets a location_id
Multiple care_sites may have the same location_id (e.g. Physician offices in the same building). Same as above
Person's and care_sites may change addresses. So a given person-location_id and care_site.location_id are valid for an interval of time.

pavgra commented 5 years ago

@gowthamrao , I agree with your concerns. The only thing, I believe, which is tricky - is two separate tables care_site_location and person_location. I personally would like those but we need to take into account the fact that as soon as a new entity with location arrives - we would need one more location linking table for it, so that schema would grow fast, plus it may be complicated for layman to get into the variety of tables.

Anyway, maybe we could proceed with the rest of proposed changes (geo vocabs, derived tables with pre-calculated geo relations and removal of location_id from person / care_site) so that we don't get stuck and move forward with AEGIS, and then we could proceed with figuring out a better approach for the location tables. Your thoughts?

gowthamrao commented 5 years ago

as soon as a new entity with location arrives - we would need one more location linking table for it,

Ok.

maybe we could proceed with the rest of proposed changes (geo vocabs, derived tables with pre-calculated geo relations and removal of location_id from person / care_site)

I am fine with proceeding with this design, after resolving this question. Is location_id a unique standardized geocoded street address? If yes, then location_id cannot be the pk of this new denormalized location table, because we are allowing duplication of addresses.

OHDSI / CommonDataModel

Bi-directional link between CARE_SITE and LOCATION_HISTORY #220

Removal of `location_id` column from `person` and `care_site`

Relevant tables:

Issue

Proposed solution

Pros

Cons

OHDSI / CommonDataModel

Bi-directional link between CARE_SITE and LOCATION_HISTORY #220

Removal of location_id column from person and care_site

Relevant tables:

Issue

Proposed solution

Pros

Cons

Removal of `location_id` column from `person` and `care_site`