Closed Frijol closed 3 years ago
@Frijol the old and new permits are linked on this link https://www.epa.gov/npdes-permits/chelsea-river-bulk-petroleum-storage-facilities-npdes-permits
thanks!
So, attempting to generate automated difference-finding between the old and new permits here: https://github.com/edgi-govdata-archiving/permit-diffs
But running into issues because they are formatted differently— the diffing engine is not able to match like chemicals to like—here's an example:
So it seems we will at the least need to do a reordering of the important information, and at the most just extract the relevant data, which might be easier in e.g. a spreadsheet anyway
This is something that would absolutely be worth bringing up during the first public comment: lack of facilitation of comparison between the two permit types. Here's part of the 2014 permit for Sunoco:
And the same part of the 2021 draft for Sunoco:
You can see that there are comparable lines (e.g. Benzene), but because these are provided only in PDF table format, it's impossible to compare the two without arduous hand-comparison. A more useful version of a request for public comment would include a summary of differences between the proposed and existing permits, ideally with explanations for each change, and even better if the data tables on both could be provided in a portable format e.g. CSV so that differences could be quickly understood, calculated, and tabulated.
A copy/paste of the data table on the left yields the unintelligible single-column data shown on the right:
ugh so you can make a bit of headway with the tables via excel formula, but they didn't even keep the number of cells per line consistent so it's still a full manual headache. Here's what that looks like, can be continued if useful: https://docs.google.com/spreadsheets/d/1mm5ic9UoD0hL2HYzBtXid-aIyv5N45KbrNW_e8q9KsE/edit#gid=605363090
...you can't even use this trick (bad as it is) on the 2014 PDF though, it throws a line of data into a single cell instead of a series of vertical cells.
Basically, I'd like to ask the EPA to (..in general) release portable formats in addition to these pretty-but-machine-useless PDFs
Over the past two weeks, I have been looking at the differences between the drafts and 2014 permits completely manually. So far, I have transferred all of the data from Sunoco and Chelsea Sandwich into excel (for both 2014 and the drafts). I completely agree, it has been a nightmare! I have been putting the PDFs into a converter to get them as an excel sheet, which leads to several empty cells. From there I deleted all the empty cells to make it more cohesive. For now I have been working on color coding the different rows to spot the differences between the two. Chelsea Sandwich is completely done but Sunoco still needs some color-coding. Sonja is working on Gulf, Annie on Irving and Hyun was trying to work through some problems with R to see about the Global permits. I believe there was three different ones condensed to a single permit for Global, which might also make finding differences little more complicated.
The google sheets I am currently working on is here https://docs.google.com/spreadsheets/d/1XxAgo3RWdmXvSbsDpBKLOGn2SXL8DpWbvXz_MjQAEYk/edit#gid=655207875
Nice! Also tagging @shansen5 who is exploring another technical approach to this (even if we end up doing this all manually, it would be really great to know how to do this more efficiently for future projects if at all possible)
I extracted the tables from Chelsea2014 and Chelsea2021 for a few pages, converting them to CSV. Let me know if this would be worth doing more of. The resulting CSVs are in this folder: https://drive.google.com/drive/folders/13jTzimhqb3_mzusUsFhi9NjYEQ9rr3Vm?usp=sharing. I used this tool: https://github.com/tabulapdf. The commands used with tabula are in the Chelsea-commands.txt file in the google folder. If the table isn't at the top of the page we must calculate an offset to the top of the table on the page. But it does a pretty good job.
I wasn't too successful at comparing the permits myself but here's the link to the doc that describes my work in trying to compare the three 2014 Global permits to the new combined draft permit for the three Global facilities (under the date 3/2/21): https://docs.google.com/document/d/1xC6yaxpHUSZt32yF6r0XzRYoziLsVPu92sBC2KLRHEs/edit
Whoa this is a huge improvement @shansen5! Here's your Gulf CSVs opened in OpenCalc, pasted into a google sheet so we can ref it, more or less matched up 2014 to 2021, with conditional formatting applied to highlight differences:
"Gulf from csv" tab on https://docs.google.com/spreadsheets/d/1mm5ic9UoD0hL2HYzBtXid-aIyv5N45KbrNW_e8q9KsE/edit#gid=23047417
@lizk143 does this help your existing workflow?
As you can see, the tables don't totally map to each other, some things got added, some removed, text changed, so the format highlighting is iffy at best. But it seems like a better starting point than trying to copy out values by hand
Also very impressed with your success on this @lee-hyun1. Do you have that R code in a repo we can access? Would be interested to play around with it a bit
@Frijol I don't currently have my code in a repo. Should I create a new one and put it in the EDGI github?
@lee-hyun1 as an attempt to reduce permissions complexity, I made this blank repo (everything in it is boilerplate). Would it work for you to push to that? https://github.com/edgi-govdata-archiving/eew-permit-pdf-analysis
Where can the old & new permits be found?
@saraannwylie is anyone working on this part specifically? Would it be helpful to see if I can do some automated change detection?