Carceral-Ecologies / Carceral-OSHA-Data

GNU General Public License v3.0
2 stars 3 forks source link

Build OSHA violation database #1

Closed shapironick closed 3 years ago

shapironick commented 4 years ago

Here is the search UI for OSHA's violation database: https://www.osha.gov/pls/imis/industry.html

Luckily it seems that NAICS codes are used in this data so we can search by correctional institutions code: 922140

The NAICS code has been collected for inspections since 2003. So we should make the DB for jan 1 2003 until present day.

shapironick commented 4 years ago

I think the search function can only handle a decade at a time. when i searched for 2003- present it gave me 01/01/2009 to 10/23/2019

shapironick commented 4 years ago

Before going through and copying and pasting away your life lets see if we can get the data via a freedom of information (FOIA) request.

shapironick commented 4 years ago

We have received official acknowledgement of the FOIA. It has been assigned #884433. When they begin processing it, we will be able to track its progress at www.dol.gov/foia

savannahmhunter commented 4 years ago

Hi Nick. I was doing some looking around and I found these datasets on OSHA enforcement: https://enforcedata.dol.gov/views/data_summary.php. I downloaded the data dictionary and it looks like NAICS code is available and address, zipcode, and a over a 100 other fields. Do you think this might be useful?

savannahmhunter commented 4 years ago

And an api: https://developer.dol.gov/health-and-safety/dol-osha-enforcement/

shapironick commented 4 years ago

WHAT! This is an amazing lead. Thank you! I wish I knew how to use API's better (ie at all).

When i DL the violation CSV i only get the following headers

activity_nr | citation_id | delete_flag | standard | viol_type | issuance_date | abate_date | abate_complete | current_penalty | initial_penalty | contest_date | final_order_date | nr_instances | nr_exposed | rec | gravity

I can't tell what facility the violation is in or the NAICS code. Looking at the data dictionary it looks like NAICS codes are only in the inspections data and not violation. I'll DL inspections now. Thank you!

shapironick commented 4 years ago

There is no violation info in the inspection doc, but I think it might be possible to search 922140 in inspection and then note the "activity_nr" and then bring that over to the violation data. excellent.

savannahmhunter commented 4 years ago

You are right. It looks like detailed information about city, state, zipcode, NAICS code etc. are in the inspection dataset and the actual violation info is a separate dataset. But with the "activity_nr" code it seems like we could merge these two datasets together.

Alternatively on Tuesday we could look into setting up the API to only call the 922140 NAICS code from the inspection data and then isolate the "activity_nr" numbers. And then set up a second API to pull down all the of those specific "activity_nr" numbers from the violations dataset. If it works we would have specifically what we want and the merge might be a little easier.

I haven't set up an API but we have people in the group who I am sure can help us get it set up.

shapironick commented 4 years ago

Agreed! I think the API route might be easiest especially because the downloaded data is spread across multiple spreadsheets for each download (nice to make is small and openable not nice for analysis). we should be sure to limit our search to 2003 onwards as that's when they started to use NAICS in their data.

So for Tuesday we could have 3 groups:

1: working on the joining after eliminating smaller facilities 2: scraping the violation and enforcement data via API related to the registered MS facilities. Its a good way to test that work, and test it on data that is immediately valuable. Data w/ faclities numbers is available in this issue. 3: downloading the OSHA data as you describe via API

savannahmhunter commented 4 years ago

An update: The API call is not working. The department of labor has two types of API calls - version 1 and version 2. Version 2 is what we have to use to get the inspection data but it is not working. I found an open issue on this here: https://github.com/USDepartmentofLabor/Developer/issues/141. I tried their recommended solution but it is still not working. I will respond to them that the solution doesn't work but in the meantime we might want to just work with the csv files on the site.

Instead of the API call we can download the csv files and then filter so only prisons remain. This will take a bit because the osha data is split across five csvs and those will need to be merged. Then in terms of filtering prisons I realized when I search the word "prison" in the data some prisons are not classified under the NAICS code (likely because they are older - the data goes back to 1970) or they are misclassifed under NAICS or SIC codes. So it is possible using the API might have left us with a dataset with fewer prisons than actually exist.

shapironick commented 4 years ago

So well sleuthed! I've been checking on that DOL issue, it looks like (as of an hour ago) that the DOL webmaster is asking for details of your process. Maybe worth emailing her/him/them? I don't think I can get the csv work done before meeting and have to meet with the formerly incarcerated student group during our regular meeting time tonight. do you think it would be best to contact DOL for the API fix or muscle through the CSVs? Thank you, Savannah!

savannahmhunter commented 4 years ago

Hi Nick. I will check in with the DOL. It might be best to work with the CSVs in the meantime. I started on that last week and will keep plugging away.

shapironick commented 4 years ago

You are golden! thank you!

shapironick commented 4 years ago

Happy new year! It looks like on Jan 6th the error may have been fixed? does it work for you now?

savannahmhunter commented 4 years ago

Hi Nick! Happy New Year. I haven't tried the API again. What I did in December was download the OSHA inspections and violations datasets and I started writing some code to try to isolate the prison facilities in the inspection dataset by using the SIC/NAICS codes AND searching for key terms our group identified (see the google doc "echo_prison_merge" in the google drive to see the list of terms). In looking over the dataset I learned that many prison facilities were not properly classified by the SIC or NAICS codes so using those codes for the API would result in a return of fewer prison facilities than actually exist in the data.

My thinking is once we can isolate all of the prisons in the inspections dataset we can then use the activity_nr numbers to pull relevant information from the violations dataset (or any others we want). We could definitely try the API for this step. But I haven't moved beyond the first step of isolating prisons in the inspections dataset which has to be done in order to get the activity_nr number to match with the other datasets. Our team over the last couple weeks has been focusing on understanding the data in ECHO better and are writing up a report on what is in there and what variables we can use for an analysis.

In the meantime I will upload the datasets I downloaded to the google drive and work on getting the code I started up on github so we can all see it.

shapironick commented 4 years ago

the eternal SIC/NAICS code drama!! This sounds excellent. Thank you so much! I'm still pushing forward w/ my FOIA but it will only have the SIC/NAICS code data so yours will be more robust. THANK YOU!

shapironick commented 4 years ago

Hi Savannah,

I just heard back from OSHA in response. There was a little bit of push back last month but pointing to your issue posting on DOL GitHub was helpful in getting past that snag.

Shapiro response.pdf

The rest of the attachments won't upload so I'll forward it to you.

savannahmhunter commented 4 years ago

Hi Nick! We looked through the response and attachments this evening. In the response they recommended we use this data explorer. We looked at it and it is useful for isolating the data you want by industry and it gives you all the datasets (violations, inspections, etc) at once which is great because using the data catalog you have to download each dataset individually and then match facilities across.

We did use the explorer to isolate prisons by the sic code and the explorer found over 7000 inspections for prisons BUT it would not let us download the data because it said the file size was too big and directed us to use the data catalog....which is what I used originally to download all the OSHA inspection data. Again the data catalog makes you download each dataset (inspection, violation etc) individually and then merge them yourself. So their suggestion is somewhat helpful but the data being too big we can't actually use the explorer unless we further subset by state or something to make the file size smaller, in which case we would have to merge them anyway.

Additionally using the explorer means that any prisons that were not properly classified with the 9223 sic code do not appear in the dataset. Part of the code I am working on from the full inspection data I downloaded in December from the data catalog is to search for facility names that are likely prisons but are not coded as 9223. So far we found over 1200 inspections that may be prisons. We need to spot check more thoroughly next week to make sure. We know for sure that some of them are not prisons. But if that number is close and the data explorer is missing that many prison facilities it might just be better to stick with the data I downloaded from the data catalog to be more precise. We plan to look into this more next week. If it does turn out that over 1000 prisons are not properly classified we may eventually want to let the DOL know. Perhaps they are already aware of coding issues or have noted this somewhere in the documentation but pointing people to the data explorer as a solution may not be ideal if not all of the data is coded properly.

shapironick commented 4 years ago

Thank you Savannah! I'm sorry this has been such an epic pain! Your path forward sounds great and I totally agree about letting DOL know about the errors. We're planning on doing the same for EPA.

Excellent work y'all!

On Tue, Feb 25, 2020 at 8:16 PM savannahmhunter notifications@github.com wrote:

Hi Nick! We looked through the response and attachments this evening. In the response they recommended we use this data explorer https://enforcedata.dol.gov/views/searchExplorer.php. We looked at it and it is useful for isolating the data you want by industry and it gives you all the datasets (violations, inspections, etc) at once which is great because using the data catalog you have to download each dataset individually and then match facilities across.

We did use the explorer to isolate prisons by the sic code and the explorer found over 7000 inspections for prisons BUT it would not let us download the data because it said the file size was too big and directed us to use the data catalog....which is what I used originally to download all the OSHA inspection data. Again the data catalog makes you download each dataset (inspection, violation etc) individually and then merge them yourself. So their suggestion is somewhat helpful but the data being too big we can't actually use the explorer unless we further subset by state or something to make the file size smaller, in which case we would have to merge them anyway.

Additionally using the explorer means that any prisons that were not properly classified with the 9223 sic code do not appear in the dataset. Part of the code I am working on from the full inspection data I downloaded in December from the data catalog is to search for facility names that are likely prisons but are not coded as 9223. So far we found over 1200 inspections that may be prisons. We need to spot check more thoroughly next week to make sure. We know for sure that some of them are not prisons. But if that number is close and the data explorer is missing that many prison facilities it might just be better to stick with the data I downloaded from the data catalog to be more precise. We plan to look into this more next week. If it does turn out that over 1000 prisons are not properly classified we may eventually want to let the DOL know. Perhaps they are already aware of coding issues or have noted this somewhere in the documentation but pointing people to the data explorer as a solution may not be ideal if not all of the data is coded properly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/1?email_source=notifications&email_token=ABZH333GGSVM4W3U5KCFNLDREXUKBA5CNFSM4JEM4EK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6XJEQ#issuecomment-591230098, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH33Z7NTXPIFDWB5NQLG3REXUKBANCNFSM4JEM4EKQ .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366

savannahmhunter commented 4 years ago

Hi Nick. I think we made some great progress tonight. Rhea and I have got the OSHA inspection and violation datasets ready! Both datasets are in the HACK4CA google drive. The datasets include prisons with inspection closure dates from 2010 and later. We decided to isolate the data for recent dates because we discovered some of the prisons from 1970 did not exist anymore and we decided more recent data is probably a better indicator of potential issues. We isolated prisons using the SIC code and we searched the facility names using a list of keywords we created to identify prisons that did not have a sic code.

The inspection dataset identified around 1500 inspections. The violation dataset identified a little over 2000 violations. There are 809 inspection numbers which do not have a corresponding violation.

There are other OSHA datasets that we may want to pull more information from. We should look into this at the next meeting. But this should be fairly easy to do now that we have the activity_nr numbers for prison facilities isolated.

Additionally, at the next meeting we should start doing some descriptive analysis on the datasets. We have not cleaned the datasets at all and looking through them there is likely a lot of missing data and other issues.

shapironick commented 4 years ago

Fantastic! That's so exciting! would you maybe want to zoom with my lab after you have cleaned up the data and started the descriptive analysis? just let me know? congrats on pushing through and yielding a workable dataset!

On Tue, Mar 10, 2020 at 8:23 PM savannahmhunter notifications@github.com wrote:

Hi Nick. I think we made some great progress tonight. Rhea and I have got the OSHA inspection and violation datasets ready! Both datasets are in the HACK4CA google drive. The datasets include prisons with inspection closure dates from 2010 and later. We decided to isolate the data for recent dates because we discovered some of the prisons from 1970 did not exist anymore and we decided more recent data is probably a better indicator of potential issues. We isolated prisons using the SIC code and we searched the facility names using a list of keywords we created to identify prisons that did not have a sic code.

The inspection dataset identified around 1500 inspections. The violation dataset identified a little over 2000 violations. There are 809 inspection numbers which do not have a corresponding violation.

There are other OSHA datasets https://enforcedata.dol.gov/views/data_summary.php that we may want to pull more information from. We should look into this at the next meeting. But this should be fairly easy to do now that we have the activity_nr numbers for prison facilities isolated.

Additionally, at the next meeting we should start doing some descriptive analysis on the datasets. We have not cleaned the datasets at all and looking through them there is likely a lot of missing data and other issues.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/1?email_source=notifications&email_token=ABZH337H55PCIKHHXOJFCW3RG37YRA5CNFSM4JEM4EK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEON7SUQ#issuecomment-597424466, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH3344VSX4LGACNKHZUNTRG37YRANCNFSM4JEM4EKQ .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366

savannahmhunter commented 4 years ago

Of course! We are not scheduled to meet next week as it is finals week. And the following week is spring break. So Rhea and I will be able to start in on this March 31. I imagine it might take us a couple weeks. So probably around mid-April we should have some analysis ready.

shapironick commented 4 years ago

Zoinks! https://www.ire.org/wp-content/uploads/2019/03/osha.pdf osha

savannahmhunter commented 3 years ago

This project is complete. The dataset is available in the google drive folder in both long and wide versions. CA_prison_insp_viol_2010_op_fac_long.csv and CA_prison_insp_viol_2010_op_fac_wide.csv.

savannahmhunter commented 3 years ago

I am going to close this issue.