annarupert commented 11 months ago

Here is the issue for data! I am writing the first draft now but need a few things from the group:

Guidance: "Describe your data. Where you got it from, how it was generated, what variables you’ll use, what data cleaning steps you had to take, where your processed data, code and documentation is stored. In a published paper, a lot of this detail will be in a data appendix. For the purposes of this report, include it all here (this may be the longest section of your report)"

Since we had a substantial edit to the data we need to summarize what change we actually made. @wmwaghor could you please send a 3-4 sentence overview on what you did with the data and why it was beneficial for our project.

Other than that our data section is super straight forward. Thanks and feel free to input any ideas in this repo @ecn310/accidentsteam

annarupert commented 11 months ago

Data Draft 1 (without Will's explanation of his data fix)

The data being used for this research project is from the Occupational Safety and Health Administration (OSHA). OSHA collects data across different firms each year in hopes to measure workplace safety. This data is cross sectional data that has more than 20 variables across approximately 350,000 establishments. Our data was sourced from Professor Singleton’s work on workplace safety and violations.

We are testing the relationship between firm size and number of injuries. For this research project, we are using several of the given variables and also generating our own variables to help create a more robust analysis. To begin, our first goal is to make sure the data is clean and that it will generate usable conclusions. Using Stata, we worked on making sure the data made logical sense, meaning that we would replace data that was not possible to achieve and therefore must have been a recording issue. This made little to no difference in the final analysis.

After completing the data fix, we generated variables that made sense for our hypothesis. For example, we know we want to focus on the Total_Injuries variable but in bigger firms there is bound to be more total injuries due to the increased number of employees. To combat this issue, we generate a new variable entitled Inj_Rate this variable creates a per establishment rate based on the number of employees at that establishment. To do this we calculated the quotient of total_injuries over the annual_average _employees. In doing so, we now can look at the injury rate across all firm sizes and compare that raw data. We replicated this process across all the different injury types: skin disorders, respiratory issues, and poisonings. Professor Singleton used a similar method in his paper “The Effect of Workplace Inspections on Worker Safety”, and due to the origin of our data and guidance, we proceeded with a similar route.

The final change we made to the data was creating the variable decile This variable uses annaual_average_employeesand separates it into ten groups based on percentile. This variable is the most efficient way to answer our hypothesis because to find the relationship between firm size and workplace injuries we need to come to an understanding of what “firm size” means. For our research, firm size is based on the number of employees and is separated into 10 groups. This promotes a more holistic understanding of establishment size and the trends associated with it.

annarupert commented 11 months ago

Here are some things I plan on adding today:

Processed Data and Storage: Mention where the processed data, code, and documentation are stored. Provide details on the format of the processed data for transparency.
Data Cleaning: Discuss the data cleaning process, emphasizing any specific challenges encountered during the cleaning. Address any recording issues and elaborate on how they were resolved, emphasizing the impact on the final analysis.
Data Source: Explicitly mention that the data is sourced from the Occupational Safety and Health Administration (OSHA). Specify the nature of the OSHA data, such as its scope, coverage, and any unique characteristics.
Data Collection and Generation: Elaborate on how OSHA collects data across different firms each year to measure workplace safety. Provide insights into the nature of cross-sectional data and why it is suitable for the research question.

wmwaghor commented 11 months ago

Data Draft 2 (Will Edition)

The data being used for this research project is from the Occupational Safety and Health Administration (OSHA). OSHA collects data across different firms each year in hopes to measure workplace safety. This data is cross sectional data that has more than 20 variables across approximately 350,000 establishments. Our data was sourced from Professor Singleton’s work on workplace safety and violations.

We are testing the relationship between firm size and number of injuries. For this research project, we are using several of the given variables and also generating our own variables to help create a more robust analysis. To begin, our first goal is to make sure the data is clean and that it will generate usable conclusions. Using Stata, we worked on making sure the data resolved logically, meaning that we would remove data based on an equation that identified ratios of variables that were impossible to achieve, likely put in place because of a procedural issue with reporting. For the variable annual_average_employees, we determined that values below zero, above or equal to the total_hours_worked value in the same entry, or valued at 123456 were to be replaced with a null value. We also determined that for the variable total_djtr_days, if the value was above or equal to total_hours_worked in the same entry, or if it was below zero, they were to be replaced with a null value. This made little to no difference in the final analysis.

After completing the data fix, we generated variables that made sense for our hypothesis. For example, we know we want to focus on the Total_Injuries variable but in bigger firms there is bound to be more total injuries due to the increased number of employees. To combat this issue, we generate a new variable entitled Inj_Rate this variable creates a per establishment rate based on the number of employees at that establishment. To do this we calculated the quotient of total_injuries over the annual_average _employees. In doing so, we now can look at the injury rate across all firm sizes and compare that raw data. We replicated this process across all the different injury types: skin disorders, respiratory issues, and poisonings. Professor Singleton used a similar method in his paper “The Effect of Workplace Inspections on Worker Safety”, and due to the origin of our data and guidance, we proceeded with a similar route.

The final change we made to the data was creating the variable decile This variable uses annaual_average_employeesand separates it into ten groups based on percentile. This variable is the most efficient way to answer our hypothesis because to find the relationship between firm size and workplace injuries we need to come to an understanding of what “firm size” means. For our research, firm size is based on the number of employees and is separated into 10 groups. This promotes a more holistic understanding of establishment size and the trends associated with it.

ecn310 / course-project-accidentsteam

First Draft Data #19

Data Draft 1 (without Will's explanation of his data fix)

Data Draft 2 (Will Edition)