Open savannahmhunter opened 4 years ago
These are such great questions! Some quick guiding thoughts:
Money the way they set the fines is usually pretty arbitrary it it's a similar process to how the EPA administers fines. So I wouldn't use that as a proxy for toxicity/harm. But looking at how fines are distributed on their own is going to be fascinating!
Inspections Inspections are less an index of toxicity, and more an index of attention by the state. we might want to also see if inspection # is correlated with violation #
Number of workers exposed My guess is that this is a very conservative (non representational) number. but more research would be needed. It might be that they cite based upon how many workers were seen to be exposed at the time of inspection rather than all potential people exposed.
Number of violations yes!
standards cited This will take some time! and is going to be a long hard process. I would only unpack this after prelim violation analysis (just to make sure the data is worthwhile)
On Tue, Mar 31, 2020 at 7:36 PM savannahmhunter notifications@github.com wrote:
We need to think about how we want to operationalize toxicity in prisons using the OSHA data. This involves looking through the data dictionary for the variables available in the datasets and identifying variables that may help us understand which facilities might be exposing prisoners to toxic or hazardous conditions.
We could think about this in a variety of ways. Do we want to count the number of inspections? The number of violations? Do we want to define toxicity by how much money the prison facility was charged for the violation? Do we want to look at the standards cited? Do we want to look at the number of workers exposed? Potentially we may want to do all these things. Let's start with a list of potential variables that we might want to look at. And then we can make some decisions.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH33YCFNZXVWDNQLQJULDRKKSAPANCNFSM4LYJ36EQ .
-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366
this is where the google doc that we're working off of is https://docs.google.com/document/d/1jAeEZE8QrR86DcsiqfHyJC7dJKJuhn24/edit
Code to import the data = viol = read.csv(file = "Cleaned_Data/prison_insp_viol_2010.csv", header = TRUE, stringsAsFactors = FALSE)
So everything but violations! As Savannah will be doing violation
I just wanted to comment some of the descriptive analysis I've been doing on here. Specifically, I started with the number exposed variable because it seemed like an easier thing to get my feet wet.
ggplot(data = viol) + geom_histogram(mapping = aes(x = nr_exposed), binwidth = 0.5, na.rm=TRUE) max(viol$nr_exposed, na.rm=TRUE) min(viol$nr_exposed, na.rm=TRUE) mean(viol$nr_exposed, na.rm=TRUE) median(viol$nr_exposed, na.rm=TRUE)
The max is 7500, the min is 0, the mean is 123.87, and the median is 8. I was thinking about finding standard deviation, skew, etc. but I don't know how helpful that is, seeing as I'm just looking at the count. Is the goal to try to find a correlation between two variables e.g. number exposed and gravity??
My thinking is that if gravity represents how serious a violation is, we can see how much the exposure is played into this rating?
This is great! I think zooming in on some of those higher number of exposed people will be interesting later on. Like we could zoom in on anything over 100 people as that would likely include incarcerated workers (one way to maybe make a proxy of which issues relate to incarcerated people).
Maybe you can experiment with some histogram plots? and you can play around with bin size to see how that effects the plot.
Here's a histogram w/ the mean plotted on top of it
gghistogram(my_data, x = "col_nameh", bins = 9, add = "mean")
Right now we are on a very basic level of exploration! gotta keep it basic for now otherwise we won't know what we're doing when we get deeper into the data and it will be meaningless. So we can get to the analysis part a bit more once we learn more about the data set. Now that Savannah has amazingly added the facility ID #s and names it could be very important to see just how many unique facilities there are. we can compare that number to the number of non state facilities in the HIFLD database, and that should give us an understanding of coverage of this data set, which is super important. https://stackoverflow.com/questions/41906878/r-number-of-unique-values-in-a-column-of-data-frame
What other basic things would we want to know? (try and come up and perform some of these on your own)
Also a histogram of how many unique inspections there are per facility would be helpful.
You could start a google doc in the drive and keep a log of your explorations and then share them with us on tuesday (and/or post a link here). Once you feel like you've got a good understanding of the basics of the data then you can move on to things like gravity vs number of exposed, and then you can tell us what you think that plot tells us. This is about you learning how to do this and us learning about the database at the same time so don't worry if you come back w/ more questions than answers :)
This is really encouraging Nick! Thanks so much for your help as always :)
viol <- read.csv("prison_insp_viol_2010.csv", header = TRUE, stringsAsFactors = FALSE) View(viol)
library(tidyverse)
library(lubridate)
library(shiny)
library(shinydashboard)
unique_establ <- viol %>% select(estab_name) %>% n_distinct() unique_establ
barplot(table(viol$estab_name))
(table(viol$estab_name))
unique_establ <- viol %>% select(estab_name) %>% n_distinct() unique_establ
viol %>% group_by(site_state) %>% summarize(current_penalty_cost = sum(current_penalty, na.rm = TRUE), initial_penalty_cost = sum(initial_penalty, na.rm = TRUE)) %>% ungroup()
viol %>% group_by(estab_name) %>% summarize(current_penalty_cost = sum(current_penalty, na.rm = TRUE), initial_penalty_cost = sum(initial_penalty, na.rm = TRUE)) %>% ungroup()
viol %>% ggplot(aes(x = estab_name)) + geom_bar() + labs(title = "Inspections Per Facility", x = "Facility Name", y = "Unique Inspection Count") theme_bw() + theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))
viol %>% filter(site_state == "CA") %>% ggplot(aes(x = estab_name)) + geom_bar() + labs(title = "Unqiue Inspections Per Facility in California", x = "Facility Name", y = "Unique Inspection Count") + theme_bw()+ theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))
table(viol$estab_name)
viol %>% select(nr_exposed) %>% summary()
viol %>% ggplot(aes(x = estab_name, y = nr_exposed)) + geom_bar(stat = "identity") + labs(title = "Number of Individuals Per Establishment", x = "Facility Name", y = "Number of Individuals") + theme_bw()+ theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))
viol %>% filter(site_state == "CA") %>% ggplot(aes(x = estab_name, y = nr_exposed)) + geom_bar(stat = "identity") + labs(title = "History of Number of Individuals Exposed Per Establishment", x = "Facility Name", y = "Number of Individuals") + theme_bw()+ theme(text = element_text(size=4), axis.text.x = element_text(angle = 90))
table(viol$year)
colSums(is.na(viol))
Note: I have also pushed my code onto my forked version of my repo. But I thought I would copy and paste here too.
This is rad! would you want to paste some of the plots you made in this thread? and then maybe you can share some work in meeting on monday?
I'd love to! Will figure out how to do that soon!
Hi @nathanqtran922 @shapironick! We had our first Hack4CA meeting yesterday since spring quarter. I wanted to check in with you two about any summer progress on the OSHA stuff and how to coordinate our work going forward. From this thread it looks like Nathan accomplished some cool stuff in May. I'm trying to re-orient myself to this project after working on other things this summer. I am working on populating the wiki with data quality issues we uncovered last spring, I am working on adding some issues/to-dos, and want to get my descriptive analysis file up on github as well from last spring. I apologize. I thought I put it up there before but apparently not. I also started a separate branch called savannahhunter where I will work before pushing stuff to the master branch. I also see Nathan said he forked the repository but I didn't see his code so if you know where I can check that out let me know. I want to make sure we are coordinating our work and not duplicating. Thanks!
Hi Savannah! Exciting! I wasn't sure if this project would continue and I'm happy to hear that it will. I won't be attending H4CA meetings this quarter but look forward to keeping up-to-date asynchronously. I think Nathan copied and pasted all of his code into this thread so it's all here!
Just a couple questions to make sure all this work is worth your time. What kind of publication do you see yourself working towards with this data? What kind of arguments can be made with this compromised data? These are meant to be helpful/reflective/stratagizing questions and not snubby ones. No need to think quickly on these just wanted to check in and re-assess the trajectory of this arm of the work. Sending my best, Nick
On Wed, Oct 7, 2020 at 2:57 PM Savannah Hunter notifications@github.com wrote:
Hi @nathanqtran922 https://github.com/nathanqtran922 @shapironick https://github.com/shapironick! We had our first Hack4CA meeting yesterday since spring quarter. I wanted to check in with you two about any summer progress on the OSHA stuff and how to coordinate our work going forward. From this thread it looks like Nathan accomplished some cool stuff in May. I'm trying to re-orient myself to this project after working on other things this summer. I am working on populating the wiki with data quality issues we uncovered last spring, I am working on adding some issues/to-dos, and want to get my descriptive analysis file up on github as well from last spring. I apologize. I thought I put it up there before but apparently not. I also started a separate branch called savannahhunter where I will work before pushing stuff to the master branch. I also see Nathan said he forked the repository but I didn't see his code so if you know where I can check that out let me know. I want to make sure we are coordinating our work and not duplicating. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/3#issuecomment-705215085, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH335CMZ2LMV63OH6VPSTSJTP2RANCNFSM4LYJ36EQ .
-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366
We need to think about how we want to operationalize toxicity in prisons using the OSHA data. This involves looking through the data dictionary for the variables available in the datasets and identifying variables that may help us understand which facilities might be exposing prisoners to toxic or hazardous conditions.
We could think about this in a variety of ways. Do we want to count the number of inspections? The number of violations? Do we want to define toxicity by how much money the prison facility was charged for the violation? Do we want to look at the standards cited? Do we want to look at the number of workers exposed? Potentially we may want to do all these things. Let's start with a list of potential variables that we might want to look at. And then we can make some decisions.