Closed huijuechen closed 5 years ago
Script 1, data cleaning complete (by Juno).
substance
and source
types, grouping dates
by quarter of year and grouping volume
into two size categoriesNext up on the to-do list (for lab today):
sklearn
(script 3)graphviz
to visualize the decision tree (script 4)Script 3, model building.
sklearn
to build model in Pythonsklearn
only takes in numbers----TBD: what kind of summary to output to CSV
Script 2 is complete and working from the command line. I still need to clean up the file, which I will finish later tonight and then send a pull request.
Script 3, model building is done
Output:
Script 4, model visualization is done
Output:
TO DO:
For the final report:
For converting the pdf of decision tree to png, I tested the command line code below and it worked:
sips -s format png results/oil_spills_model.pdf --out results/oil_spills_model.png
I added above to the end of the README.md command line arguments.
Now the png "results/oil_spills_model.png" is ready to use in the final report.
Done.
Added the run_all.sh file and the command line for running it to README.
For the first script of cleaning data, right now I can think of three things to do:
Remove all the rows with null/empty values (mostly in the columns of
substance
,volume
,volume_units
)Update volume and volume unit into consistent number format
Put all the dates in the ISO format
TBD: if the "unknown" should be removed from the
source
column.