data-liberation-project / epa-rmp-coordination

A repository to track efforts to publish the EPA RMP database and associated documentation.
1 stars 0 forks source link

Identify a way to deduplicate accidents reliably #5

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

The RMP submission guidelines appear to ensure that the same accident will sometimes appear in multiple filings. It would be great if we could find a way to reliably identify/deduplicate them. The simplest approach would be to look for literally-duplicate accident-history rows ... but what happens if a facility slightly tweaks the description or details of a given accident between filings?

alexisraykhel commented 1 year ago

For the description question, we can use a few different techniques to either assign scores or actually say "these things are essentially the same" or completely deduplicate (depending on how prescriptive we want to be). Ideas from simplest to most complex:

  1. Only find exact matches as you said and don't worry about anything else.
  2. https://github.com/seatgeek/thefuzz Use fuzzy matching to find entries with many fields that match and text fields that almost match at some confidence (we can play around to see which confidence makes sense). I've used this method before for other kinds of matching/finding synonyms and it's pretty good.
  3. https://www.zingg.ai/documentation-article/the-what-and-why-of-entity-resolution I haven't used this library before but have heard good things. It's a lot like number 2, but with more sophistication and some machine learning involved.
  4. Use Universal Sentence Embeddings (USE), which is based off of large language models and finds a general distance between texts. It's more accurate than fuzzy matching, but also takes a lot more compute and is more complicated to set up initially.

For any of these, we would still use the non-free-text meta data to determine if it is a close match, and then use one of these techniques to determine how close the free text matches.

jsvine commented 1 year ago

Thanks, @alexisraykhel! This is a very handy overview. In the case of accidents, there won't be a ton of freeform text to work with, unfortunately; mostly just numeric and boolean values (see example entry below), plus a list of chemicals IDs and quantities. But developing some "similarity" metric to compare potentially-duplicate accidents may be useful. First step, perhaps, is to gather some examples of entries that really do seem like the same event but have slightly different data.

Here's that example entry:

column value
AccidentHistoryID 1000076897
FacilityID 1000080757
AccidentDate 2022-02-21 00:00:00
AccidentTime 0552
NAICSCode 325211
AccidentReleaseDuration 00002
RE_Gas 0
RE_Spill 1
RE_Fire 0
RE_Explosion 0
RE_ReactiveIncident 0
RS_StorageVessel 0
RS_Piping 0
RS_ProcessVessel 1
RS_TransferHose 0
RS_Valve 0
RS_Pump 0
RS_Joint 0
OtherReleaseSource (None)
WindSpeed 1.0
WindSpeedUnitCode a
WindDirection W
Temperature 40.0
StabilityClass (None)
Precipitation 0
WeatherUnknown 0
DeathsWorkers 0
DeathsPublicResponders 0
DeathsPublic 0
InjuriesWorkers 0
InjuriesPublicResponders 0
InjuriesPublic 0
OnsitePropertyDamage 0.0
OffsiteDeaths 0
Hospitalization 0.0
MedicalTreatment 0.0
Evacuated 0.0
ShelteredInPlace 0.0
OffsitePropertyDamage 0.0
ED_Kills 0
ED_MinorDefoliation 0
ED_WaterContamination 0
ED_SoilContamination 0
ED_Other (None)
InitiatingEvent b
CF_EquipmentFailure 0
CF_HumanError 1
CF_ImproperProcedure 0
CF_Overpressurization 0
CF_UpsetCondition 0
CF_BypassCondition 0
CF_Maintenance 1
CF_ProcessDesignFailure 0
CF_UnsuitableEquipment 0
CF_UnusualWeather 0
CF_ManagementError 0
CF_Other (None)
OffsiteRespondersNotify No, not notified
CI_ImprovedEquipment 0
CI_RevisedMaintenance 0
CI_RevisedTraining 1
CI_RevisedOpProcedures 0
CI_NewProcessControls 1
CI_NewMitigationSystems 0
CI_RevisedERPlan 0
CI_ChangedProcess 0
CI_ReducedInventory 0
CI_None 0
CI_OtherType (None)
CBI_Flag 0
alexisraykhel commented 1 year ago

OOoh I see what you mean. When I hear the word "description" my mind goes to text first. I think we can still work out some kind of similarity score like you said though.

alexisraykhel commented 1 year ago

@jsvine In the above example, were you getting that table from RMPData --> 'AccidentChemicals', 'AccidentChemicals_ChangeHistory', 'AccidentHistory', 'AccidentHistory_ChangeHistory'? I'm looking into deduping or at least assigning a score, but am wondering which table(s) to focus on. I think these 4 all have the potential to have dupes based on the discussion above.

jsvine commented 1 year ago

Ah, I should have clarified that above. In that example, the data is coming directly (and only) from tblS6AccidentHistory. Each row in that table represents an accident in a given submission's five-year accident history.

To join it to the corresponding facilities, you'll also need the EPAFacilityID from tblS1Facilities, joined on FacilityID (which is that misnomer-of-a-column and actually represents a submission not a facility).

jsvine commented 1 year ago

A potential lead: The resources on this EPA page on a proposed regulation update include a technical background document (EPA-HQ-OLEM-2022-0174-0066_content.pdf) and Excel workbook (EPA-HQ-OLEM-2022-0174-0065_content.xlsx) that appear to use and enumerate a list of unique accidents in the RMP database from 2004 to 2020.

Some observations:

After filtering by what EPA considers a reportable accident, EPA reports 2,493 accidents. EPA removed four duplicates as well as 57 accidents which reported “Environmental Impacts Other” but listed “None” in the open text box field. These 57 accidents did not have any other impacts reported, resulting in a final accident database of 2,436 accidents between 2004 and 2020.

jsvine commented 1 year ago

I've now also sent the EPA an email re. the observation directly above, asking if they can clarify/provide the technique they used.

kylebutts commented 1 year ago

To chime in, I've used https://moj-analytical-services.github.io/splink/index.html on personal work and it's really great. Lets you write a set of categorical variables that describe quality of links between two items (example match FacilityID, AccidentDate within 3 days of one another, etc.) and then train a prediction model on features of the data to teach it.

You could train the model with exact matches and that would teach it the parameters of the model for predicting in cases with non-exact matches. Would be happy to help with this

jsvine commented 1 year ago

Thanks, @kylebutts! I haven't seen that package before, but it seems helpful indeed. If you'd like to take a crack at this, that'd be wonderful.

One thing that might help with training: Within an individual submission, that submission's accident list should already not contain any duplicates. (The duplicates only arise because of the overlap in the accident histories submitted by the same facility over time.)

jsvine commented 1 year ago

Good news! I think I sorted this out, and it ended up being simpler than I had anticipated. Here's the relevant commit: https://github.com/data-liberation-project/epa-rmp-spreadsheets/commit/051bd25e53bf07e29e19fdd19f80acc421b121bd

And the methodology is described here: https://github.com/data-liberation-project/epa-rmp-spreadsheets#output-files

Closing this issue, but please do feel free to chime in with thoughts and whether you observe any problems with the methodology and/or implementation.