OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
870 stars 446 forks source link

Bi-directional link between CARE_SITE and LOCATION_HISTORY #220

Closed pavgra closed 1 year ago

pavgra commented 5 years ago

Table care_site contains location_id field while rules for location_history also say that care_site can be a target domain:

The entities (and permissible domains) with related locations are: Persons (PERSON), Providers (PROVIDER), and Care Sites (CARE_SITE).

Therefore we get doubled link which leads to inconsistency: which location is true?

gowthamrao commented 5 years ago

My recommendation is

  1. Keep location table as is.
  2. Change location_history as follows
Field Required Type Description
location_history_id Yes integer A unique identifier (primary key) for each location_history.
Person_id No integer The unique identifier for the person (fk).
Care_site_id No integer The unique identifier for the care_site (fk).
location_id Yes integer A foreign key to the location table.
relationship_type_concept_id No integer The type of relationship between location and entity. E.g home address, work address, vacation address
start_date Yes date The date the relationship started.
end_date No date The date the relationship ended.

I don't expect there will be other entities needing locations (e.g. providers) i.e. I don't anticipate this table expanding with new FK. This structure allows for referential integrity in rdbms.

gowthamrao commented 5 years ago

@pavgra

For your precalculated table location_distance, I recommend

Person_location_id Care_site_location_id -- not entity_location_id Distance Unit_concept_id -- not varchar

gowthamrao commented 5 years ago

@pavgra why do we need vocabulary_id if area_concept_id is a concept_id in CDM vocabulary? We can find the vocabulary_id from the concept table.

screenshot_20181105-210151__01

gowthamrao commented 5 years ago

. On top of that we could preload and distribute an instance of a PostGIS db with required packages and functions (notably local geocoding).

@rtmill @pavgra is that possible? This sounds similar to EE? We should explore this @pavgra Especially, if it solves the most common shapes

But the much more common case would be non-political boundaries. Yes, a hospital is a point, however, hospital service areas and provider shortage areas, for example, are polygons. Here's an example from dartmouthatlas with an HSA in multiple states:

pavgra commented 5 years ago

@gowthamrao ,

why do we need vocabulary_id if area_concept_id is a concept_id in CDM vocabulary?

I want to define explicitly that for each vocab we could have only one area assigned to a location. That's why I put PK of vocabulary_id + location_id. Otherwise, nothing stops you from putting several area ids for a single location per vocab. And you don't need it since having lowest level id (which is e.g. town), you can go up the hierarchy of concepts using concept_relationship / concept_ancestor and retrieve all required stuff.

pavgra commented 5 years ago

Ok, having summarized all the discussions with @gowthamrao , here is the updated proposal:

image

I am in agreement with @gowthamrao that having duplicates in locations would produce much more data in pre-calculated tables + from data purity perspective this is better + want be sure that @rtmill is happy.

@gowthamrao , please confirm that you are in support of the proposal (diagram).

On top of that we could preload and distribute an instance of a PostGIS db with required packages and functions (notably local geocoding).

After I thought of Foreign data wrappers, the idea sounded interesting for me. We could define a separate source daimon for Spatial DB: if CDM DB supports spatial functions - we specify its connection details, otherwise we bundle PostGIS, configure external connection and specify the PostGIS connection details as the daimon.

gowthamrao commented 5 years ago

@gowthamrao , please confirm that you are in support of the proposal (diagram).

Yes, in support. The premise is that we can build area_concept_id with hierarchy in vocabulary table, identify lowest area level.

I do want us to support two types of area vocabulary_id: OSM and Census Tiger files.

https://www.census.gov/geo/maps-data/data/tiger-line.html

https://www.census.gov/geo/reference/hierarchy.html

gowthamrao commented 5 years ago

@rtmill

I would propose to provide a user downloading geo vocabs with polygons table to allow doing all the spatial stuff he / she wants. But this polygon data, as @cgreich mentioned, should reside in results schema.

This seems important, we should try to allow custom areas in results schema.

Forcing one-to-many relationships into one-to-one representations keeps biting us and, ironically, is at least partially the reason this issue thread was started. Which region type is assigned? What if some patients have zip and some have tract and some have hospital service area? It's messy

In @pavgra solution, multiple vocabulary_id may be used to identify lowest_area_concept_id

But, take the hierarchy of census vocabulary -- which is the lowest (when there is a tree structure with multiple terminal branches).

screenshot_20181105-224319__01

We need a local geocoder (assumption that PHI never leaves server)

Does bundling postgis help? Who has experience doing this?

We are going to build a tool to pick and choose which regional attributes you want to import to the person level. Without this, we'd have to do a straight ETL which would flood the exposure tables with unwanted data. To enable this, we need a staging DB to store the source data (ACS, EPA, NOAA, etc)

Makes sense. It's like the metadata work. We are storing area level metadata for person level analysis. E.g. percent college education in a county.

AEW0330 commented 5 years ago

We could define a separate source daimon for Spatial DB: if CDM DB supports spatial functions - we specify its connection details, otherwise we bundle PostGIS, configure external connection and specify the PostGIS connection details as the daimon. <

I find this idea appealing though I have no idea what bundling really involves here. Apart from that, I am unclear on a couple of points about the implementation. Do we know what spatial functions should be classified as requirements and used as the criteria for this bundle/no-bundle switch? We might need to enumerate those functions that get used in this build and add others we think will be needed to support common GIS use cases in the future. If we define spatial functions much beyond "can do optimized spatial or spatio-temporal joins", we might not expect there to be many CDM DBs that would get the no-bundle option.

cgreich commented 5 years ago

Friends:

Oh boy. This conversation is spinning.

  1. Why do we need a LOCATION_HISTORY table? Didn't we say long ago we will have a not-normalized LOCATION table, which will take care of the history? And in it, we'd have a person_id (if it is a person we are a locating) or a care_site_id (if that) and a provider_id.
  2. The latter you want to abolish. We can talk about it.
  3. I still don't understand why we need the vocabulary_id. The concept_id has a unique vocabulary_id. Having it here will create the contradictions @pavgra usually gets upset about.
  4. I thought we wanted to test the assumption that each location is linked to one low-level region, and the rest of the regions have a hierarchical relationship to that one in the vocabulary system.
  5. We really need to pre-calculate distances? An analytic can't take like 10 minutes?

Look: I think all we need is the LOCATION tables which includes the history like @gklebanov proposed. And a hierarchical system of regions in the vocabulary like @pavgra thinks we can build. Let's test how that works. In parallal, @rtmill et al. are going to create a complex representation and test that. And then we compare.

Deal?

AEW0330 commented 5 years ago

point to polygon relations are many-to-many (not many to one) At least, due to possiblity to have areas represented by different coding systems and containing the same point <

Only if polygons of different types live in the same table. My sense from the Tuft GIS folks is that's not the typical way GIS services are supported if they accommodate multiple sources with different polygons. It might be good to make sure that is the way this will and should go. Their initial take was that we should have separate tables for each class of polygons with associated precalculated variables. I think that is to ensure timely visualization not analysis @cgreich - 10 min wait to add a visualization layer is probably not tenable. I assume that (separate tables per polygon class) is only needed when polygons don't role up. There are plenty of health relevant regional attributes that don't come in administrative or census units. We're still in the process of confirming whether that's their recommendation given the GIS WG use cases. I still don't have a clear sense of how those use cases differ from the AEGIS to ATLAS use cases to feel confident about their relevance here. But, as I wrote previously, I think this project might benefit from a clear sense of how meatadata are used to drive the references and querying from disparate GIS data sources. If that metadata driven strategy is "industry standard", it is bound to effect our thinking what data gets standardized and where it lives and what GIS functions are essential to support natively.

pavgra commented 5 years ago

@cgreich ,

  1. The reasons were described above:
    • performance (of both geo relations pre-calc and clustering) with deduped locations would be much better
    • normalization
  2. As soon as you have 2 or more geo vocabs, a single FK would stop working - you'll have 2 low-level areas for one location, just because of 2 vocabs. For the same reason, the location_area.vocabulary_id field was proposed, to differentiate between lower-level area concepts for a single location (although you could derive vocab from concept, the field should enforce a person to keep in mind that only one area concept per location per vocab should be presented in the table).
  3. Analytic cannot use geospatial functions. They are not supported by PDW, Redshift, Impala. It's not just a matter of time, it's a matter of being capable to build cohorts with geo criteria.
cgreich commented 5 years ago

I assume that (separate tables per polygon class) is only needed when polygons don't role up. There are plenty of health relevant regional attributes that don't come in administrative or census units.

@AEW0330. Look. Why don't we do this: Create a use case. And then we talk. We can always throw additional one-to-man linkers into the mix, but right now I am working off of a hierarchical region system where things role up. As the detault. Because today we do not have the ability to say "Stratify the cohort by US state".

performance (of both geo relations pre-calc and clustering) with deduped locations would be much better

I don't care about performance. We are not building an application. We are modeling the space. If you need to do some calculation, and you need to normalize the data to save time - go for it. We don't need that in the OMOP CDM. In the OMOP CDM we need the important relevant data as simple and compact as possible without the risk of creating contradictions.

As soon as you have 2 or more geo vocabs, a single FK would stop working

I know. So far I have heard the use case "hospital area". Fine. That's not patient-level data we need to harmonize. That is external data.

Analytic cannot use geospatial functions.

That is a good point. But it is not limited to geospatial. The pregnancy model also cannot be run in RDBS, it is an R script today. Here, we do what we can do in a RDBS. That's the point of this.

So, let me repeat. Let's do:

  1. The minimal model we (all of you, really) proposed, plus testing it out. We can use any data, even if it has only 3-letter zip. Use case: stratify by state.
  2. A propler geospatial model the way @rtmill and @AEW0330 are thinking of, plus testing it out. Do we have an OMOP CDM with potential geo information anywhere to test it? Use case: stratify by hospital region or by distance to care site.

Yes? :)

gowthamrao commented 5 years ago

@cgreich

Is there suggestion to create a new schema (separate from CDM, results, vocabulary schemas) called DERIVED schema - designed for precalculated/precomputed/application specific tables, that are derived from the omop CDM? Advantage being, these are managed by the application developers - not the CDM wg?

cgreich commented 5 years ago

Friends:

This is a war of attrition, it seems. :) Had another discussion with @pavgra and @gowthamrao, and I think we now are ready for a compromise. Here it is:

  1. The collapsing of LOCATION and LOCATION_HISTORY and the drop of location_id from PERSON and CARE_SITE will happen, but not now. Not urgent. We can postpone.
  2. We add region_concept_id to LOCATION and introduce the geo vocabs. This is for standard regions.
  3. We add the CUSTOM_AREA for @gowthamrao's dynamic areas.
  4. We add the LOCATION_DISTANCE table for precalculated distances. It will contain only Person-Care Site pairs that exist in the data, which will make the size palatable.
pavgra commented 5 years ago

Two edits:

@cgreich , please confirm

gklebanov commented 5 years ago

@cgreich - how do we get this proposal out on a table and have it ratified?

pavgra commented 5 years ago

@cgreich , could you tell what's the ETA for:

  1. We add region_concept_id to LOCATION
  2. We add the LOCATION_DISTANCE table for precalculated distances. It will contain only Person-Care Site pairs that exist in the data, which will make the size palatable.

?

clairblacketer commented 5 years ago

@gklebanov @pavgra we can put this on the agenda for the April meeting

cgreich commented 5 years ago

@pavgra:

  1. As @clairblacketer said. On the agenda for next month. However, if you want to risk it go ahead and implement it already. I don't see any issues.
  2. That will induce significant discussion. Even though you don't do the entire Cartesian product, there are still numerous Care Sites for each patient, multiplying the size of the PERSON table. And then we have the Location History. Is that in there as well? I know you want to precalculate this so it needn't be noodled at analysis time, but again, folks will have opinions.
pavgra commented 5 years ago

@cgreich :

there are still numerous Care Sites for each patient, multiplying the size of the PERSON table

I doubt that each patient presented in a database will have visits to all care sites

And then we have the Location History. Is that in there as well?

It is related only for the process of getting relations between a person and care sites

cgreich commented 5 years ago

Well, one way to help the discussion is to count the number of Care Sites per patient in a database or so, so we know what the damage would be.

It is related only for the process of getting relations between a person and care sites

You said that. And the person has a Location History. So, if your use case is to use the distance from the Person's home to the Care Site for some analytic, it would be a different place before and after the Person moved. So the total size is # Person average # Care Sites average # Locations per person.

What is your use case anyway?

pavgra commented 5 years ago

From the https://github.com/OHDSI/WebAPI/issues/649#issuecomment-430848222 :

Example in Atlas/circe-be: that should be supported at both cohort entry events (qualified events) and inclusion criteria (inclusion events) are: where the care_site.location_id was within x units of distance from person.location_id (eucledian distance between two points defined by lat/long)

I am not aware of other proposed solutions to solve the problem in such way that would work for all OHDSI supported DBs. If you have such, would be nice to hear

what the damage would be

Even though the numbers are going to be quite big, search be distance intervals is quite efficient both in relational DBs (covered by B+ tree, close to logarithmic complexity) and column oriented ones (due to compression). So this should not differ so much from domain tables lookup by some concept id

rtmill commented 5 years ago

@gowthamrao @pavgra @cgreich @AEW0330

I am very sorry for dropping the ball on this conversation. No idea how I missed the comments back in November.

want be sure that @rtmill is happy.

Finally somebody with their priorities straight! 😄

Guys, this is a complicated topic. We are trying to integrate a whole new realm of content, both in terms data (non person-centric) and functionality (requirements not universally supported by all dbms flavors). These conversations are incredibly productive and I hope they continue, and I apologize for the opacity and snail's pace of the GIS WG (we're getting there), but forcing the above implementation into the CDM, in my opinion, would be a step backwards.

I understand there are deadlines to consider here but this seems to be shortsighted fix for a small handful of use cases. These sprints are invaluable for figuring out the details of this project, but do we need to rip apart the CDM to do them? If you agree that the above proposal is not a long term extensible strategy, then we are adding to the CDM now just to tear it down later.

Food for thought: What if we consider it to be a GIS 'extension', at least while we work on development. The CDM doesn't get modified and we segregate all GIS tables to a different schema or database. We could then add the connection information and type to source and source_daimon (or a similar approach), and build the functionality into ATLAS just the same. When we need specific calculations, at least those that I see in your use cases, we can always leverage R (distance 1, distance 2, location to region).

Sincerely yours, Stick-in-the-mud

pavgra commented 5 years ago

@rtmill , a couple of questions:

I believe, these questions were raised a while ago and there were no answers from you. So either we can do and evaluate what is simple and works or we can wait till no one knows when for a theoretically ideal solution.

rtmill commented 5 years ago

@pavgra We have been focused on other portions of the project (schema and staging data specifically) but I can take a dive into alternate approaches, if that's what you're asking for.

I'll need to do my homework and get more familiar with circe and atlas. Could you provide any more information about why R isn't an option beyond:

Limitations of geo criteria executed in R If we imagine execution of Geo criteria in R, the following issues come out: Geo criteria can be placed only into "Inclusion criteria" section since it should be processed separately (by R). Geo criteria cannot be nested or cannot nest other criteria since it should be processed separately (by R). Since geo criteria require processing by separate component, we cannot really embed them into circe expression - any change in circe expression (cohort JSON) would require reflection in Exported SQL. These limitations are not acceptable.

Is it a matter of functionality or design principles? As in, if we brewed up a solution that did use R, would the atlas folks be amenable?

cgreich commented 5 years ago

I need two proposals, one for each. The region_concept_id I got (even though flimsy), but we can bring it up and ratify at the next meeting, and I am sure it will go through.

The LOCATION_DISTANCE I still need. Don't get me wrong, I am not undermining it. All I am saying is there will be more discussion.

pavgra commented 5 years ago

@cgreich , what are the questions which still remain? I thought I've explained why there shouldn't be performance concerns and why it doesn't differ from other domain tables. In other words, what's required?

cgreich commented 5 years ago

Nothing. I am not the one to argue with. We will bring it to the CDM WG, and that's where you have to defend it. But you need the proposals to @clairblacketer as an Issue to the Common Data Model.

cgreich commented 5 years ago

Also: We need a instruction sheet with the proposal to explain how that information is used, and how an ETLer determines for each LOCATION record the correct region_concept_id based on a calculation which polygon a point (address, zip code, 3-digit zip) is located in. Makes sense?

pavgra commented 5 years ago

The first part - https://github.com/OHDSI/CommonDataModel/issues/252

pavgra commented 5 years ago

The second part - https://github.com/OHDSI/CommonDataModel/issues/253

clairblacketer commented 3 years ago

What is the outcome of this discussion? It looks like it split into the region_concept_id proposal (#252) which has been accepted and will be added to v6.1 and location_distance proposal #253. Can this one be closed?