UBC-MDS / DSCI_522_Alberta-Oil-Spills

1 stars 2 forks source link

PROJECT UPDATE: Question Change, Proposal Update #9

Closed alyciakb closed 5 years ago

alyciakb commented 5 years ago

Plan change explanation:

Our original proposal involved five hypothesis tests to see if there were significant differences between equipment failure caused oil spill incidents and operator error caused incidents on the factors: time of year, location, source of spill, substance released, and volume released.

This week, upon further review of the data we realized that due to our data being categorical with more than two variables per factor, we would need to complete multiple hypothesis tests per factor (over 20 hypothesis tests), which would greatly increase our chance of a false positive result, due to multiple testing.

As multi-variate ANOVA testing is still new to us and we do not yet feel comfortable with it, we decided to change our question and analysis plan. We spoke with Tiffany and she approved it.

New analysis question:

What are the three strongest predictors of the cause type of of an oil spill incident in Alberta?

This is a predictive question.

Analysis Plan

The data set we will be uses requires cleaning prior to the analysis. We will first remove rows that have NULL, empty, or "Unknown" variables. We will group within our features to reduce the number of variable categories per feature - for example, there are several different types of water that can be released in a spill, we will group those into one general "water" variable.

Next we will divide our data into a training group and a test group and create our decision tree using sklearn in Python:

The features are:

  1. Spill location (by field office)
  2. Time of year (by quarter)
  3. Source (well, pipeline, battery)
  4. Type of substance released (oil, gas, water)
  5. Volume released (small spill - less than 10 cubic metres spilled, or large spill - over 10 cubic metres spilled)

The targets are the oil spill causes:

  1. Equipment failure
  2. Operator error

We will then choose the first three features in the decision tree as the top predictors of the cause of an oil spill. We will report those along with our decision tree accuracy.

Analysis Presentation

Along with a verbal write-up of our findings from above, we will include:

  1. Data visualization graphs for each of the features
  2. A drawing of the decision tree using graphviz in Python
  3. A table ranking the features from top to lowest predictors
  4. A graph visualizing the accuracy of our decision tree on both the training data and the test data