Alycia Butterworth
: alyciakbHuijue (Juno) Chen
: huijuechenThe goal of this project is to find the top predictors of the cause of a spill incident in the Alberta oil industry. Using data from the Energy Resources Conservation Board (ERCB), provided by the City of Edmonton, we reviewed spills that occurred between 2002 and 2013 and fit a machine learning model (decision tree) to predict spill cause as either (1) Equipment Failure or (2) Operator Error based on five factors: (1) the field house location of the spill, (2) the time of year it occurred, (3) the source of the spill (well, pipeline or battery), (4) the substance spilled (oil, gas production, or water) (5) and the volume spilled. We were able to obtain a gini score for each factor and rate them accordingly.
We will use the following contents from the data:
Column name | datatype | Description |
---|---|---|
cause | String | Identifier for a particular cause of spill |
source | String | equipment source of the spill incident |
location | String | location of the spill incident |
substance | String | substance of the spill incident |
volume | Numeric | volume of the substance spilled (unit: cubic metre) |
year_quarter | Numeric | quarter of the year for when the spill occurred |
Our findings showed that the top three predictors for the cause of an oil spill in Alberta are:
The full final report that includes our analysis and interpretation, limitations, and future directions can be found within the doc folder. The report also summarizes and visualizes our original data, visualizes the decision tree model created for this analysis, discuesses its accuracy and the results. This analysis can be fully reproduced by a user within their terminal. Please see the Procedure section below for instructions.
To run this analysis yourself, the scripts need to be run in the following order:
sklearn
to fit a model in 3_model_fitting.pyThe output files creating when running this report are:
clean_data.csv
file that is used for the rest of the analysis.img
folder that help the user visualize the data we are using for the analysis.results
folder.depth_compare.png
, that shows the results of the 10-fold cross validation, saved in the results
folder.results
folder as final_model.sav
.spills_tree_model.png
a png file in the results
folder.doc
folder.This report can be reproduced using Docker. To run the analysis using Docker:
docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/alberta_oil_spills alyciakb/dsci_522_alberta-oil-spills make -C '/home/alberta_oil_spills' all
docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/alberta_oil_spills alyciakb/dsci_522_alberta-oil-spills make -C '/home/alberta_oil_spills' clean
This report can be run in the Docker environment or without Docker. To run this report directly from your terminal, we have provided a Makefile
that will run the scripts in the correct order and produce all output. To run the analysis using the Makefile
:
make all
.Makefile
, type the following command in your terminal: make clean
.Dependency diagram of Makefile:
R & R Libraries:
R (version 3.5.1)
rmarkdown (version 1.10)
tidyverse (version 1.2.1)
lubridate (version 1.7.4)
gridExtra (version 2.3)
Python & Python Libraries:
Python (version 3.7.1)
numpy (version 1.15.1)
pandas (version 0.23.2)
seaborn (version 0.9.0)
matplotlib (version 3.0.2)
scikit-learn/sklearn (version 0.20.1)
graphviz (version 0.10.1)
Version | Description |
---|---|
V1.0 | Initial Project Proposal |
V2.0 | Milestone 1 - Project Analysis First Draft Complete |
V3.0 | Milestone 2 - Makefile Automation Added & Project Analysis Edits |