data-liberation-project / epa-rmp-coordination

A repository to track efforts to publish the EPA RMP database and associated documentation.
1 stars 0 forks source link

Create first set of novice-friendly spreadsheets #4

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

From the documentation draft, here are some spreadsheet typologies to aim for:

Ideally, these would be defined as SQL queries, so that they could work both for spreadsheet generation and for the Datasette instance. But if anyone would like to start cranking on this in Python/R/etc., that seems totally fine/reasonable for generating useful prototypes.

jsvine commented 1 year ago

As a partial start on the first one, here's some proof-of-concept SQL:

SELECT
    FacilityID AS SubmissionID,
    EPAFacilityID,
    FacilityName AS FacName,
    FacilityCity AS FacCity,
    FacilityState AS FacState,
    TRIM(
        COALESCE(FacilityStr1, '') || ' • ' || COALESCE(FacilityStr2, '')
    ) AS FacAddr,
    SubmissionType,
    SUBSTR(MAX(ReceiptDate), 1, 11) AS ReceiptDate
FROM TblS1Facilities
GROUP BY EPAFacilityID
ORDER BY ReceiptDate DESC, EPAFacilityID DESC;

Running that against the RMPData.sqlite file produces this (showing just first five rows):

SubmissionID EPAFacilityID FacName FacCity FacState FacAddr SubmissionType ReceiptDate
1000099890 100000232401 Danville Terminal Danville IL 3400 East Jones Road • R 2022-02-25
1000100087 100000224544 Unified Grocers, Inc. Stockton CA 1990 Piccoli Road • R 2022-02-25
1000100106 100000218016 Allied Agronomy Barlow Barlow ND 6648 11th St NE • R 2022-02-25
1000100107 100000165028 Webster Central Wastewater Treatment Facility Webster TX 855 Magnolia Avenue • R 2022-02-25
1000099363 100000133358 VanDeMark Chemical Inc. Lockport NY 1 North Transit Road • R 2022-02-25
jsvine commented 1 year ago

A bit more progress on the following:

They still need some work and fleshing out (e.g., flagging fields redacted as confidential business information), but something concrete to start working with and exploring. Eager for feedback / suggestions for improvement!

Submissions.csv

Column Description Sample Row
SubmissionID Renamed from FacilityID 1000100063
ReceiptDate The date EPA received the submission 2022-02-23
ValidationDate The date the EPA finished validating the submission. 2022-02-23
SubType The submission type (F: First-time Submission, R: Resubmission, C: Correction) R
SubReason The reason given for submitting, chosen from a list of options. Voluntary update (not described by any of the above reasons)
EPAFacilityID The core facility identifier, linkable to other EPA datasets 100000043099
FacName Facility name listed on this submission H.B. Fuller - Roseville Plant
FacFTE The number of full-time equivalent staff at the facility. 25
FacCompany1 Main facility owner H.B. Fuller
FacCompany2 Secondary facility owner
FacOperator Facility operator H.B. Fuller
Chemicals Chemicals disclosed in the submission. Quantity, supposed to be reported in pounds, indicated as {number}. Sometimes can include duplicate entries when two separate processes use the same kind of chemicals. Vinyl acetate monomer [Acetic acid ethenyl ester] {869925} • Public OCA Chemical {0}
AccidentChemicals Chemicals released in the accidents in this submission's five-year accident history, with quantities released as {number} Vinyl acetate monomer [Acetic acid ethenyl ester] {1}
NumAccidents Number of accidents in the submission's five-year accident history 1
LatestAccidentDate The most recent accident disclosed in the five-year accident history 2021-04-30

LatestSubmissions.csv

Same structure as above, but includes only the most recently-validated RMP for each facility.

Facilities.csv

These are facility-descriptive data extracted from each facility's most recent submission. It has some overlap with the file above, but focuses more on facility characteristics (such as city, coordinates, total submissions, etc.)

Column Description Sample Row
EPAFacilityID Core facility identifier 100000119917
FacName Facility name Beaver Creek Gas Plant
FacState Facility state WY
FacCity Facility city Riverton
FacAddr Facility street address 45 Beaver Creek Road
FacLat Facility latitude 42.84708
FacLng Facility longitude -108.31316
FacFTE Number of full-time equivalent staff 9
FacCompany1 Primary owner Devon Energy Production Co. LP
FacCompany2 Secondary owner Devon Energy Corporation
FacOperator Facility operator Devon Energy Production Co. LP
LatestSubmissionDate Date of most recent RMP submission 2018-04-09
LatestValidationDate Date of most recent RMP validation by EPA 2018-04-09
NumSubmissions Number of RMP submissions 6
jsvine commented 1 year ago

And here's a first pass at the accident-history data: Accidents.csv / 03-accidents.sql.txt

Important note: This includes all accident histories reported in all RMP submissions, which means that it includes duplicate entries for many accidents. Judging by the EPAFacilityID and AccidentDate, roughly half appear to be duplicates.

The spreadsheet is ordered first by EPAFacilityID (so that a facility's accidents are all grouped together), and then in reverse-chronological by accident date and time.

Data dictionary:

column description value
EPAFacilityID Core facility identifier 100000000081
FacName Facility name at time of RMP submission Burr Oak Branch
SubmissionID Renamed from FacilityID 1000010076
ValidationDate The date the EPA finished validating the submission. 2009-08-17
AccidentHistoryID An ID unique to the submission-accident entry, not to the accident itself 1000007699
AccidentDate Date of the accident 2004-04-12
AccidentTime Time of the accident as HHMM 1345
AccidentChemicals Chemicals released in this accident, with quantities released as {number} Ammonia (anhydrous) {5}
...many additional details The 66 remaining columns come directly from the tblS6AccidentHistory table. See the raw data dictionary and the relevant section of the EPA's RMP submission guide for context.
jsvine commented 1 year ago

This is (and has been for a while) done here!: https://github.com/data-liberation-project/epa-rmp-spreadsheets