covidcaremap / covid19-healthsystemcapacity

Open geospatial work to support health systems' capacity (providers, supplies, ventilators, beds, meds) to effectively care for rapidly growing COVID19 patient needs
https://www.covidcaremap.org
MIT License
97 stars 38 forks source link

Implement better matching for facility information #70

Open lossyrob opened 4 years ago

lossyrob commented 4 years ago

The goal of this issue is to implement a better matching algorithm for the facility-level processed data.

We use this data to produce the CovidCareMap US Healthcare System Capacity data on a facility level, which is then rolled up to the regional levels.

Currently (at commit b006476) the matching is implemented in spatial_join_facilities and run in the notebook Merge_Facility_Information

The way this is implemented is as follows:

This methodology is not ideal in that:

This issue is to generate a new matching method that improves what we currently have.

There's other libraries to solve the scoring problem - one that I've seen used successfully is [dedupe(https://github.com/dedupeio/dedupe).

CovidCareMap.org is currently matching DH and HCRIS data. However there's other datasets we want to bring in (HIFLD being the first). Ideally this matching enhancement can have the ability to join N number of facility data.

simonkassel commented 4 years ago

I'll take this on today

daveluo commented 4 years ago

thanks @simonkassel !

CovidCareMap.org is currently matching DH and HCRIS data. However there's other datasets we want to bring in (HIFLD being the first). Ideally this matching enhancement can have the ability to join N number of facility data.

Trying to match facilities from HIFLD to DH and HCRIS would be a great stretch goal. We're probably going to need to add in HIFLD data (per https://github.com/covidcaremap/covid19-healthsystemcapacity/issues/49#issuecomment-603893399) soon enough so this would help enable that.

simonkassel commented 4 years ago

Looks like part of the problem is that the coordinates in the DH data are not great. I'm re-geocoding the DH data using the same process as HCRIS and comparing the results. Here's one example:

m1 m2 m3

The original is orange while the re-geocoded version is purple

simonkassel commented 4 years ago

Notebook to generate those maps here https://github.com/simonkassel/covid19-healthsystemcapacity/blob/sk/facility-matching/notebooks/processing/01B_Mapping_Facilities.ipynb

simonkassel commented 4 years ago

@lossyrob @daveluo I've been working away at this and have made some progress. The notebook where I do it is here and most of the underlying logic here. I'm using a combination of string similarity and distance matching to find plausible pairs.

I'm finding matches for about 85% of the facilities within each dataset. I have been looking over them and they're not all perfect but they're pretty close. I'm not really sure that the remaining 15% are necessarily a fault of the matching process. There just seems to be a number of discrepancies between the two datasets. For example, one of them (I think DH) seems to have all the VA facilities but the other doesn't. and there are lots of cases in which there doesn't seem to be any logical DH pair for a correctly geocoded facility in the HCRIS dataset (or vice-versa). And sometimes it is even difficult to tell if two records are a match, just because of the complexity of the medical facilities and whether or not one complex is mutliple hispitals, etc.

I will post some examples below but I have been examining them using these folium maps that I created for each state (it didn't seem to be able to render the whole datasets). They are here. I'm not sure if there is an easy way to host them but let me know if you think there would be a good way to do it and it would be useful.

I can continue to tinker with this but would be curious to know if you see a best way to proceed from here.

simonkassel commented 4 years ago

Here's an example from central california: the purple marker is in the HCRIS dataset and the orange one is in the DH, they correctly did not match. If you zoom in you can see they are at different hospitals but there is no credible match for either in the other dataset

image image image

In this case, the lower point is a match, two points on top of each other but the orange marker is a VA that doesn't seem to be included in HCRIS image image

One more, see the two distantly connected points: they are both in the same network (CPMC) and there are two other centers that match with each other elsewhere in the city. So are these different facilities or is it an administrative address or something? Kind of hard to say image image