To-do for the week of Nov 24th

huijuechen commented 5 years ago

For the first script of cleaning data, right now I can think of three things to do:

Remove all the rows with null/empty values (mostly in the columns of substance, volume, volume_units)
Update volume and volume unit into consistent number format
Put all the dates in the ISO format

TBD: if the "unknown" should be removed from the source column.

alyciakb commented 5 years ago

Script 1, data cleaning complete (by Juno).

all NULL and empty rows removed
all rows with "Unknown" removed
converted volumes and dates to more useable formats
reduced number of categories per variable by grouping and generalizing substance and source types, grouping dates by quarter of year and grouping volume into two size categories
script committed and pushed to git

alyciakb commented 5 years ago

Next up on the to-do list (for lab today):

Update README to new project proposal and add current dependencies list
Create graphs to visualize our data (script 2)
Create and fit decision tree in Python using sklearn (script 3)
Rank the features from most to least predictive (script 3)
Test the tree accuracy (script 3)
Use Python and graphviz to visualize the decision tree (script 4)
Create a table(s) that shows the rankings and report the tree accuracy (script 4)

huijuechen commented 5 years ago

Script 3, model building.

using sklearn to build model in Python
reassigned numerical values to all the categorical values, as sklearn only takes in numbers
will use KNN

----TBD: what kind of summary to output to CSV

alyciakb commented 5 years ago

Script 2 is complete and working from the command line. I still need to clean up the file, which I will finish later tonight and then send a pull request.

huijuechen commented 5 years ago

Script 3, model building is done

using sklearn to build model in Python
reassigned numerical values to all the categorical values, as sklearn only takes in numbers
test out 10 max depth and find the best hyperparameter

Output:

CSVs of mapping information on 4 categorical variables to numerical values;
graph of train vs. test accuracy on different hyper parameter;
CSV of feature importance comparison;
SAV of model exporting (using Pickle)

huijuechen commented 5 years ago

Script 4, model visualization is done

using Pickle to load saved model from Script 3 in Python;
using Graphviz to draw the decision tree from the model;

Output:

PDF format of the graph

alyciakb commented 5 years ago

Completed first draft of the final report
Updated draft of README still missing some info

TO DO:

delete the unused scripts and placeholder files from the various folders

huijuechen commented 5 years ago

For the final report:

If the model accuracy (i.e. the CV score) ever needs to be shown, I built a new data frame and output to csv "results/model_score.csv".

huijuechen commented 5 years ago

For converting the pdf of decision tree to png, I tested the command line code below and it worked:

sips -s format png results/oil_spills_model.pdf --out results/oil_spills_model.png

I added above to the end of the README.md command line arguments.

Now the png "results/oil_spills_model.png" is ready to use in the final report.

huijuechen commented 5 years ago

delete the unused scripts and placeholder files from the various folders

Done.

huijuechen commented 5 years ago

Added the run_all.sh file and the command line for running it to README.

UBC-MDS / DSCI_522_Alberta-Oil-Spills

To-do for the week of Nov 24th #8