Carceral-Ecologies / Carceral-ECHO-data

In this repo we are building tools to assess environmental compliance and enforcement in US prisons, jails and detention centers
GNU General Public License v3.0
7 stars 5 forks source link

Joining ECHO and HIFLD data to filter for carceral faclities #1

Open shapironick opened 4 years ago

shapironick commented 4 years ago

The first step to analyzing the EPA data as they relate to carceral facilities is to filter the data immense ECHO data set down to just carceral facilities. We have identified HIFLD as the best data set for carceral facilities as the DOJ data is over a decade old. Performing this join is easier said than done as the ECHO data do not have consistent industry labeling (SIC or NAICS codes) that would make for an easy join.

Executing a spatial join between the lat and long of ECHO data and the shapefiles of HIFLD appears difficult because the vast majority of lat/long data in ECHO are zipcode centroids.

This issue is to discuss how to do a "fuzzy join" and the workflow will be linked to in a new file in the repo.

shapironick commented 4 years ago

@ericnost pointed out that the data on EPA's web portal has seemingly good location data that is sourced from the FRS database. So perhaps we could join the ECHO data with the FRS location data and then perform the spatial joins with the HIFLD data from there? (thanks, Eric!)

shapironick commented 4 years ago

Hi @lindsaypoirier and @klfranco I'm wondering if part of the joining issue is the assumption that ECHO does indeed have all of the carceral facilities listed. @nathanqtran922 is doing some preliminary work, looking at MS and is finding that the vast majority of facilities in the prison sheet are not in the echo database. It looks like it may be the much more numerous but smaller facilities are not registered. But we're investigating now. This could also be a deep south thing whereas the regulatory might of CA might have them all registered.

lindsaypoirier commented 4 years ago

Do you think that the FRS registry would definitely have all of the carceral facilities listed? It would definitely be a reason why the join is producing only 25% results if only 25% of the facilities are in ECHO.

On Thu, Nov 7, 2019 at 11:47 AM Nick Shapiro notifications@github.com wrote:

Hi @lindsaypoirier https://github.com/lindsaypoirier and @klfranco https://github.com/klfranco I'm wondering if part of the joining issue is the assumption that ECHO does indeed have all of the carceral facilities listed. @nathanqtran922 https://github.com/nathanqtran922 is doing some preliminary work, looking at MS and is finding that the vast majority of facilities in the prison sheet are not in the echo database. It looks like it may be the much more numerous but smaller facilities are not registered. But we're investigating now. This could also be a deep south thing whereas the regulatory might of CA might have them all registered.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Carceral-ECHO-data/issues/1?email_source=notifications&email_token=AAUI4XCJQB266JTRHNZAE53QSRWEZA5CNFSM4I7EWAGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDNS4HA#issuecomment-551235100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAUI4XFJXOUNBEY4IXN7QKLQSRWEZANCNFSM4I7EWAGA .

shapironick commented 4 years ago

I don't thinks FRS has all of em! I've only spot checked a couple in the FRS query as its very clunky. I'll have a full update on wed but i would say in MS the rate is of listed prisons is even lower than 25%.

EPA had this to say about who is and isn't included in ECHO:

There are several reasons why a facility may not be in ECHO. For example, the facility could be below regulatory thresholds, thus not regulated. Small Clean Air Act (CAA) facilities are not required to be entered into EPA's database, but may be regulated by the state. All Clean Water Act (CWA) direct dischargers, Resource Conservation and Recovery Act (RCRA) handlers, and public water systems as defined by the Safe Drinking Water Act (SDWA) should be listed in ECHO.

my guess is the smaller facilities don't meet the regulatory requirements here. I didn't see any facilities in MS listed that listed less thank 1k inmates. But have seen one w/ 998 inmates that wasn't listed.

benmillam commented 4 years ago

No breakthroughs but a few updates:

I used 'geodatabase' files for FRS (inspired by Erica last week, thanks!) and shapefiles for HIFLD, along with the 'sf' package in R, to find all facilities within 1 mile of each prison; I'll work on cleaning up and pushing my code in case it's useful at some point.

shapironick commented 4 years ago

Amazing work! Thank you Ben! Would it be helpful to have help on cleaning the data? if so happy to help, or nathan could, if he's free. Pushing your code and creating the documentation on it would be great as well.

I just received this list of FRS ID numbers from a 2016 FOIA request of ECHO data about prisons.

ECHO Facility Data - Prisons - Correctional Facilities 11-28-16.xlsx

Its from Candice Bernd at Truthout. They tried to do an analysis based on this but couldn't really make heads or tails of the data.

Thank you, Ben!

nathanqtran922 commented 4 years ago

Hey Ben,

I’d be happy to help, as Nick said. Thank you for doing this work!

Best, Nathan

On Tue, Dec 3, 2019 at 11:05 AM Nick Shapiro notifications@github.com wrote:

Amazing work! Thank you Ben! Would it be helpful to have help on cleaning the data? if so happy to help, or nathan could, if he's free. Pushing your code and creating the documentation on it would be great as well.

I just received this list of FRS ID numbers from a 2016 FOIA request of ECHO data about prisons.

ECHO Facility Data - Prisons - Correctional Facilities 11-28-16.xlsx https://github.com/Carceral-Ecologies/Carceral-ECHO-data/files/3918199/ECHO.Facility.Data.-.Prisons.-.Correctional.Facilities.11-28-16.xlsx

Its from Candice Bernd at Truthout. They tried to do an analysis based on this but couldn't really make heads or tails of the data.

Thank you, Ben!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Carceral-ECHO-data/issues/1?email_source=notifications&email_token=ANODJCK7OB6DPWTBZ3DVGDTQW2UWXA5CNFSM4I7EWAGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF2OXJI#issuecomment-561310629, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANODJCI7S6ELDW2A7LI3GF3QW2UWXANCNFSM4I7EWAGA .

benmillam commented 4 years ago

Thanks! @shapironick @nathanqtran922

I think if we were to start with the HIFLD prison list and use manual coding, we could review our 'best guess' matches from fuzzy 'probabilistic' matching and spatial joining. Maybe it would be useful to draft some coding criteria, a standardized process (doesn't have to be complex) used to check matched records -- we could then divvy up the work.

Another approach, in discussion with @klfranco, Savannah, Aishwarya, is to start with the FRS facility database and ID as many carceral facilities as we can by some search criteria (e.g. use NAICS/SIC codes, and some name keyword matching, similar to that spreadsheet from Truthout), without regard to matching them with the HIFLD list.

For this second approach, this notebook summary of my attempts to ID carceral facilities in FRS data may be useful.

shapironick commented 4 years ago

Developing a standard method sounds good. Will y'all meet on Tuesday? maybe I can call in, if you do? It will be 8pm MS time.

benmillam commented 4 years ago

Developing a standard method sounds good. Will y'all meet on Tuesday? maybe I can call in, if you do? It will be 8pm MS time.

The Davis group won't meet again until the next quarter starts in January, I'm available though throughout December. I'll reach out on Slack.

benmillam commented 4 years ago

From convo w @shapironick yesterday:

shapironick commented 4 years ago

great! We'll start work on this on the week of Jan 6th. Many thanks!

shapironick commented 4 years ago

Now that we've finished manual coding all the carceral facilities in CA, and validated that list i'm circling back to this thread, to see how our list compares to what the EPA data has.

When I do the FRS search for EPA data in CA with a NAICS code of 922140 i get 147 facilities link

Our list is 408 long, and with 345 matches, 57 no matches, and 6 possible matches. So just on the surface level the epa data has 43% of what we found. But 217 have multiple FRS IDs, and some have as many as 10 or 11.

Next I'll be attempting to see the total number of FRS IDs we found and if there are any from the EPA list that are not on ours. I'll also be trying to split the strings we have of multiple FRS numbers in a single cell into discrete cells. may the gods of jupyter shine upon me.

shapironick commented 4 years ago

More CA updates.

Here is the distribution of matches/no matches across they type of facilities: Screen Shot 2020-05-17 at 10 33 42 PM

Interestingly we have many more no matches among open facilities CA Closed v open by match status

Here is the distribution of capacity across CA facilities (124 are unknown/negative) CA capacity histogram

and we have more of the smaller and unknown capacity facilities with no match.

CA capacity by match status

shapironick commented 4 years ago

Disproportionate no matches at juvenile facilities CA Security level by match status

shapironick commented 4 years ago

In CA we have 803 FRS ID numbers for carceral facilities, in contrast to EPA's 147 (via FRS, link above). ECHO downloader only has 23.

shapironick commented 4 years ago

I looked up the NAICS code 922140 in the EPA FRS search tool. Of the 147 FRS ID's listed there 49 were not in our database. So we should a) figure out what one's we're not finding, but B) its not a big deal because we can just add them to our list. the question is how? do we go through and figure out which carceral facility it matches to or we just tack em in at the bottom of the list?