savannahmhunter commented 4 years ago

We need to think about how we want to operationalize toxicity in prisons using the OSHA data. This involves looking through the data dictionary for the variables available in the datasets and identifying variables that may help us understand which facilities might be exposing prisoners to toxic or hazardous conditions.

We could think about this in a variety of ways. Do we want to count the number of inspections? The number of violations? Do we want to define toxicity by how much money the prison facility was charged for the violation? Do we want to look at the standards cited? Do we want to look at the number of workers exposed? Potentially we may want to do all these things. Let's start with a list of potential variables that we might want to look at. And then we can make some decisions.

shapironick commented 4 years ago

These are such great questions! Some quick guiding thoughts:

Money the way they set the fines is usually pretty arbitrary it it's a similar process to how the EPA administers fines. So I wouldn't use that as a proxy for toxicity/harm. But looking at how fines are distributed on their own is going to be fascinating!

Inspections Inspections are less an index of toxicity, and more an index of attention by the state. we might want to also see if inspection # is correlated with violation #

Number of workers exposed My guess is that this is a very conservative (non representational) number. but more research would be needed. It might be that they cite based upon how many workers were seen to be exposed at the time of inspection rather than all potential people exposed.

Number of violations yes!

standards cited This will take some time! and is going to be a long hard process. I would only unpack this after prelim violation analysis (just to make sure the data is worthwhile)

On Tue, Mar 31, 2020 at 7:36 PM savannahmhunter notifications@github.com wrote:

We need to think about how we want to operationalize toxicity in prisons using the OSHA data. This involves looking through the data dictionary for the variables available in the datasets and identifying variables that may help us understand which facilities might be exposing prisoners to toxic or hazardous conditions.

We could think about this in a variety of ways. Do we want to count the number of inspections? The number of violations? Do we want to define toxicity by how much money the prison facility was charged for the violation? Do we want to look at the standards cited? Do we want to look at the number of workers exposed? Potentially we may want to do all these things. Let's start with a list of potential variables that we might want to look at. And then we can make some decisions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH33YCFNZXVWDNQLQJULDRKKSAPANCNFSM4LYJ36EQ .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366

shapironick commented 4 years ago

this is where the google doc that we're working off of is https://docs.google.com/document/d/1jAeEZE8QrR86DcsiqfHyJC7dJKJuhn24/edit

savannahmhunter commented 4 years ago

Link to data: https://drive.google.com/open?id=1eYUiHkN3rMqpVQ7KsF2hcFeUK-UTgmU5

savannahmhunter commented 4 years ago

Code to import the data = viol = read.csv(file = "Cleaned_Data/prison_insp_viol_2010.csv", header = TRUE, stringsAsFactors = FALSE)

shapironick commented 4 years ago

So everything but violations! As Savannah will be doing violation

nathanqtran922 commented 4 years ago

I just wanted to comment some of the descriptive analysis I've been doing on here. Specifically, I started with the number exposed variable because it seemed like an easier thing to get my feet wet.

ggplot(data = viol) + geom_histogram(mapping = aes(x = nr_exposed), binwidth = 0.5, na.rm=TRUE) max(viol$nr_exposed, na.rm=TRUE) min(viol$nr_exposed, na.rm=TRUE) mean(viol$nr_exposed, na.rm=TRUE) median(viol$nr_exposed, na.rm=TRUE)

The max is 7500, the min is 0, the mean is 123.87, and the median is 8. I was thinking about finding standard deviation, skew, etc. but I don't know how helpful that is, seeing as I'm just looking at the count. Is the goal to try to find a correlation between two variables e.g. number exposed and gravity??

My thinking is that if gravity represents how serious a violation is, we can see how much the exposure is played into this rating?

shapironick commented 4 years ago

This is great! I think zooming in on some of those higher number of exposed people will be interesting later on. Like we could zoom in on anything over 100 people as that would likely include incarcerated workers (one way to maybe make a proxy of which issues relate to incarcerated people).

Maybe you can experiment with some histogram plots? and you can play around with bin size to see how that effects the plot.

Here's a histogram w/ the mean plotted on top of it

gghistogram(my_data, x = "col_nameh", bins = 9, add = "mean")

Right now we are on a very basic level of exploration! gotta keep it basic for now otherwise we won't know what we're doing when we get deeper into the data and it will be meaningless. So we can get to the analysis part a bit more once we learn more about the data set. Now that Savannah has amazingly added the facility ID #s and names it could be very important to see just how many unique facilities there are. we can compare that number to the number of non state facilities in the HIFLD database, and that should give us an understanding of coverage of this data set, which is super important. https://stackoverflow.com/questions/41906878/r-number-of-unique-values-in-a-column-of-data-frame

What other basic things would we want to know? (try and come up and perform some of these on your own)

Also a histogram of how many unique inspections there are per facility would be helpful.

You could start a google doc in the drive and keep a log of your explorations and then share them with us on tuesday (and/or post a link here). Once you feel like you've got a good understanding of the basics of the data then you can move on to things like gravity vs number of exposed, and then you can tell us what you think that plot tells us. This is about you learning how to do this and us learning about the database at the same time so don't worry if you come back w/ more questions than answers :)

nathanqtran922 commented 4 years ago

This is really encouraging Nick! Thanks so much for your help as always :)

nathanqtran922 commented 4 years ago

viol <- read.csv("prison_insp_viol_2010.csv", header = TRUE, stringsAsFactors = FALSE) View(viol)

uploading data

library(tidyverse)
library(lubridate)
library(shiny)
library(shinydashboard)

Loading the relevant libraries

unique_establ <- viol %>% select(estab_name) %>% n_distinct() unique_establ

562 unique establishments

barplot(table(viol$estab_name))

bar graph for unique establishment names with different associated violations

(table(viol$estab_name))

Table for unique establishment names and different associated violations

unique_establ <- viol %>% select(estab_name) %>% n_distinct() unique_establ

562 unique establishments

viol %>% group_by(site_state) %>% summarize(current_penalty_cost = sum(current_penalty, na.rm = TRUE), initial_penalty_cost = sum(initial_penalty, na.rm = TRUE)) %>% ungroup()

Sorts penalty costs and initial costs by state

viol %>% group_by(estab_name) %>% summarize(current_penalty_cost = sum(current_penalty, na.rm = TRUE), initial_penalty_cost = sum(initial_penalty, na.rm = TRUE)) %>% ungroup()

Sorts penaltry costs and initial costs by facility

viol %>% ggplot(aes(x = estab_name)) + geom_bar() + labs(title = "Inspections Per Facility", x = "Facility Name", y = "Unique Inspection Count") theme_bw() + theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))

bar plot of number of unique inspections across all facilities

viol %>% filter(site_state == "CA") %>% ggplot(aes(x = estab_name)) + geom_bar() + labs(title = "Unqiue Inspections Per Facility in California", x = "Facility Name", y = "Unique Inspection Count") + theme_bw()+ theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))

say you want to zoom in on a specific condition e.g. you want to look at the unique inspections in California, change the filter function; simply change the filter to an area of interest to zoom in on a specific variable condition

table(viol$estab_name)

returns a table of all the establishment names and the number of inspections for each

viol %>% select(nr_exposed) %>% summary()

Brief statistical analysis of number of individuals in each establishment exposed

viol %>% ggplot(aes(x = estab_name, y = nr_exposed)) + geom_bar(stat = "identity") + labs(title = "Number of Individuals Per Establishment", x = "Facility Name", y = "Number of Individuals") + theme_bw()+ theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))

Represents the number of exposed individuals in each estabishment; can use filter to scrutinize as well

viol %>% filter(site_state == "CA") %>% ggplot(aes(x = estab_name, y = nr_exposed)) + geom_bar(stat = "identity") + labs(title = "History of Number of Individuals Exposed Per Establishment", x = "Facility Name", y = "Number of Individuals") + theme_bw()+ theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))

Bar plot of number of individuals exposed per establishment; problem with the fact that it doesn't sort through each violation

table(viol$year)

Returns inspections based off year

Let's dig deeper!! This is an excerpt I found: "In accordance with Section 17 of the OSH Act, OSHA takes into account the employer's size, good faith, previous history of violations, and the gravity of the violation when proposing penalties. Normally, a reduction of 60 percent may be applied to penalties if the employer has 25 employees or fewer; 40 percent if the employer has 26-100 employees; and 20 percent if the employer has 101-250 employees. Although no reduction for size is applied if an employer has more than 250 employees, the employer may still be accorded up to a 10 percent reduction for a lack of previous violations, and a 25 percent reduction for "good faith" which mainly depends upon the effectiveness of the employer's safety and health program. When these three factors are combined, it is possible for the smallest employers to receive up to a 95% reduction in the initial monetary penalty." From https://www.osha.gov/laws-regs/standardinterpretations/2001-05-04"

“Gravity” as defined by the FOM as the severity of injury that could result from the alleged violation combined with the probability that an injury will occur. For instances, it is more likely that an employee would fall from a steep sloped roof versus a flat roof, and more likely that they would be injured from a height of three stories versus one story. The FOM also breaks down levels of severity into four tiers, starting at “minimal” to “high” (death or permanent disability). Probability is assessed by analyzing factors like the number of workers exposed to the condition, the proximity to the hazard, and the employee’s use of safety equipment, among other factors. It looks like we have to filter out certain factors though like spills.https://www.jimersonfirm.com/blog/2019/02/how-osha-calculates-penalties/"

Idea: Let's try to explore gravity more in depth as a proxy for toxicity (we also need to filter out codes that are from non-toxic things like water spills). We can then do things like compare gravity to the violation fee to see if there is a correlation.

colSums(is.na(viol))

1696 rows of 2390 rows are missing gravity ranking

Note: I have also pushed my code onto my forked version of my repo. But I thought I would copy and paste here too.

shapironick commented 4 years ago

This is rad! would you want to paste some of the plots you made in this thread? and then maybe you can share some work in meeting on monday?

nathanqtran922 commented 4 years ago

I'd love to! Will figure out how to do that soon!

nathanqtran922 commented 4 years ago

Unique Inspections Per Facility in California Number of Individuals Exposed Per Establishment in California

nathanqtran922 commented 4 years ago

RStudio 5_31_2020 4_04_59 PM RStudio 5_31_2020 4_05_22 PM RStudio 5_31_2020 4_11_25 PM RStudio 5_31_2020 4_13_42 PM RStudio 5_31_2020 4_13_55 PM

savannahmhunter commented 4 years ago

Hi @nathanqtran922 @shapironick! We had our first Hack4CA meeting yesterday since spring quarter. I wanted to check in with you two about any summer progress on the OSHA stuff and how to coordinate our work going forward. From this thread it looks like Nathan accomplished some cool stuff in May. I'm trying to re-orient myself to this project after working on other things this summer. I am working on populating the wiki with data quality issues we uncovered last spring, I am working on adding some issues/to-dos, and want to get my descriptive analysis file up on github as well from last spring. I apologize. I thought I put it up there before but apparently not. I also started a separate branch called savannahhunter where I will work before pushing stuff to the master branch. I also see Nathan said he forked the repository but I didn't see his code so if you know where I can check that out let me know. I want to make sure we are coordinating our work and not duplicating. Thanks!

shapironick commented 4 years ago

Hi Savannah! Exciting! I wasn't sure if this project would continue and I'm happy to hear that it will. I won't be attending H4CA meetings this quarter but look forward to keeping up-to-date asynchronously. I think Nathan copied and pasted all of his code into this thread so it's all here!

Just a couple questions to make sure all this work is worth your time. What kind of publication do you see yourself working towards with this data? What kind of arguments can be made with this compromised data? These are meant to be helpful/reflective/stratagizing questions and not snubby ones. No need to think quickly on these just wanted to check in and re-assess the trajectory of this arm of the work. Sending my best, Nick

On Wed, Oct 7, 2020 at 2:57 PM Savannah Hunter notifications@github.com wrote:

Hi @nathanqtran922 https://github.com/nathanqtran922 @shapironick https://github.com/shapironick! We had our first Hack4CA meeting yesterday since spring quarter. I wanted to check in with you two about any summer progress on the OSHA stuff and how to coordinate our work going forward. From this thread it looks like Nathan accomplished some cool stuff in May. I'm trying to re-orient myself to this project after working on other things this summer. I am working on populating the wiki with data quality issues we uncovered last spring, I am working on adding some issues/to-dos, and want to get my descriptive analysis file up on github as well from last spring. I apologize. I thought I put it up there before but apparently not. I also started a separate branch called savannahhunter where I will work before pushing stuff to the master branch. I also see Nathan said he forked the repository but I didn't see his code so if you know where I can check that out let me know. I want to make sure we are coordinating our work and not duplicating. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/3#issuecomment-705215085, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH335CMZ2LMV63OH6VPSTSJTP2RANCNFSM4LYJ36EQ .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366

Carceral-Ecologies / Carceral-OSHA-Data

Operationalize Toxicity or Hazard in Prisons #3

uploading data

Loading the relevant libraries

562 unique establishments

bar graph for unique establishment names with different associated violations

Table for unique establishment names and different associated violations

562 unique establishments

Sorts penalty costs and initial costs by state

Sorts penaltry costs and initial costs by facility

bar plot of number of unique inspections across all facilities

say you want to zoom in on a specific condition e.g. you want to look at the unique inspections in California, change the filter function; simply change the filter to an area of interest to zoom in on a specific variable condition

returns a table of all the establishment names and the number of inspections for each

Brief statistical analysis of number of individuals in each establishment exposed

Represents the number of exposed individuals in each estabishment; can use filter to scrutinize as well

Bar plot of number of individuals exposed per establishment; problem with the fact that it doesn't sort through each violation

Returns inspections based off year

Idea: Let's try to explore gravity more in depth as a proxy for toxicity (we also need to filter out codes that are from non-toxic things like water spills). We can then do things like compare gravity to the violation fee to see if there is a correlation.

1696 rows of 2390 rows are missing gravity ranking