edgi-govdata-archiving / Environmental_Enforcement_Watch

General planning for EEW events
GNU General Public License v3.0
2 stars 3 forks source link

Summarize difference between proposed Chelsea Creek permits and existing ones #399

Closed Frijol closed 3 years ago

Frijol commented 3 years ago

Where can the old & new permits be found?

@saraannwylie is anyone working on this part specifically? Would it be helpful to see if I can do some automated change detection?

lizk143 commented 3 years ago

@Frijol the old and new permits are linked on this link https://www.epa.gov/npdes-permits/chelsea-river-bulk-petroleum-storage-facilities-npdes-permits

Frijol commented 3 years ago

thanks!

Frijol commented 3 years ago

So, attempting to generate automated difference-finding between the old and new permits here: https://github.com/edgi-govdata-archiving/permit-diffs

But running into issues because they are formatted differently— the diffing engine is not able to match like chemicals to like—here's an example: image

So it seems we will at the least need to do a reordering of the important information, and at the most just extract the relevant data, which might be easier in e.g. a spreadsheet anyway

Frijol commented 3 years ago

This is something that would absolutely be worth bringing up during the first public comment: lack of facilitation of comparison between the two permit types. Here's part of the 2014 permit for Sunoco: image

And the same part of the 2021 draft for Sunoco: image

You can see that there are comparable lines (e.g. Benzene), but because these are provided only in PDF table format, it's impossible to compare the two without arduous hand-comparison. A more useful version of a request for public comment would include a summary of differences between the proposed and existing permits, ideally with explanations for each change, and even better if the data tables on both could be provided in a portable format e.g. CSV so that differences could be quickly understood, calculated, and tabulated.

Frijol commented 3 years ago

A copy/paste of the data table on the left yields the unintelligible single-column data shown on the right: image

Frijol commented 3 years ago

ugh so you can make a bit of headway with the tables via excel formula, but they didn't even keep the number of cells per line consistent so it's still a full manual headache. Here's what that looks like, can be continued if useful: https://docs.google.com/spreadsheets/d/1mm5ic9UoD0hL2HYzBtXid-aIyv5N45KbrNW_e8q9KsE/edit#gid=605363090

Frijol commented 3 years ago

...you can't even use this trick (bad as it is) on the 2014 PDF though, it throws a line of data into a single cell instead of a series of vertical cells.

Basically, I'd like to ask the EPA to (..in general) release portable formats in addition to these pretty-but-machine-useless PDFs

lizk143 commented 3 years ago

Over the past two weeks, I have been looking at the differences between the drafts and 2014 permits completely manually. So far, I have transferred all of the data from Sunoco and Chelsea Sandwich into excel (for both 2014 and the drafts). I completely agree, it has been a nightmare! I have been putting the PDFs into a converter to get them as an excel sheet, which leads to several empty cells. From there I deleted all the empty cells to make it more cohesive. For now I have been working on color coding the different rows to spot the differences between the two. Chelsea Sandwich is completely done but Sunoco still needs some color-coding. Sonja is working on Gulf, Annie on Irving and Hyun was trying to work through some problems with R to see about the Global permits. I believe there was three different ones condensed to a single permit for Global, which might also make finding differences little more complicated.

lizk143 commented 3 years ago

The google sheets I am currently working on is here https://docs.google.com/spreadsheets/d/1XxAgo3RWdmXvSbsDpBKLOGn2SXL8DpWbvXz_MjQAEYk/edit#gid=655207875

Frijol commented 3 years ago

Nice! Also tagging @shansen5 who is exploring another technical approach to this (even if we end up doing this all manually, it would be really great to know how to do this more efficiently for future projects if at all possible)

shansen5 commented 3 years ago

I extracted the tables from Chelsea2014 and Chelsea2021 for a few pages, converting them to CSV. Let me know if this would be worth doing more of. The resulting CSVs are in this folder: https://drive.google.com/drive/folders/13jTzimhqb3_mzusUsFhi9NjYEQ9rr3Vm?usp=sharing. I used this tool: https://github.com/tabulapdf. The commands used with tabula are in the Chelsea-commands.txt file in the google folder. If the table isn't at the top of the page we must calculate an offset to the top of the table on the page. But it does a pretty good job.

lee-hyun1 commented 3 years ago

I wasn't too successful at comparing the permits myself but here's the link to the doc that describes my work in trying to compare the three 2014 Global permits to the new combined draft permit for the three Global facilities (under the date 3/2/21): https://docs.google.com/document/d/1xC6yaxpHUSZt32yF6r0XzRYoziLsVPu92sBC2KLRHEs/edit

Frijol commented 3 years ago

Whoa this is a huge improvement @shansen5! Here's your Gulf CSVs opened in OpenCalc, pasted into a google sheet so we can ref it, more or less matched up 2014 to 2021, with conditional formatting applied to highlight differences:

image

"Gulf from csv" tab on https://docs.google.com/spreadsheets/d/1mm5ic9UoD0hL2HYzBtXid-aIyv5N45KbrNW_e8q9KsE/edit#gid=23047417

@lizk143 does this help your existing workflow?

As you can see, the tables don't totally map to each other, some things got added, some removed, text changed, so the format highlighting is iffy at best. But it seems like a better starting point than trying to copy out values by hand

Frijol commented 3 years ago

Also very impressed with your success on this @lee-hyun1. Do you have that R code in a repo we can access? Would be interested to play around with it a bit

lee-hyun1 commented 3 years ago

@Frijol I don't currently have my code in a repo. Should I create a new one and put it in the EDGI github?

Frijol commented 3 years ago

@lee-hyun1 as an attempt to reduce permissions complexity, I made this blank repo (everything in it is boilerplate). Would it work for you to push to that? https://github.com/edgi-govdata-archiving/eew-permit-pdf-analysis