UBC-MDS / DSCI_522_Group-308_Used-Cars

This project attempts to build a regression model to predict price of used cars based on numerous features of the car
MIT License
2 stars 6 forks source link

Milestone 2 quick run notes #63

Closed pokrovskyy closed 4 years ago

pokrovskyy commented 4 years ago

Kate,

Following our discussion in class, in order to be able to quickly reproduce the Milestone 2 train / test pipeline please replace the 2 lines in scripts/train_model.py:

# Line 54:
train_data = train_data[train_data.odometer != 0]
# Replace with:
train_data = train_data[train_data.odometer != 0].sample(1000)
# Reason - only train on 1,000 examples instead of 400,000

# Line 110:
cat_transform = Pipeline(steps=[('ohe', OneHotEncoder())])
# Replace with:
cat_transform = Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore'))])
# Reason - needed because we only train on a tiny subset of full dataset

Run these commands from the root directory:

python scripts/download.py --DATA_FILE_PATH=data/vehicles.csv --DATA_FILE_URL=http://mds.dev.synnergia.com/uploads/vehicles.csv --DATA_FILE_HASH=06e7bd341eebef8e77b088d2d3c54585

Rscript scripts/wrangling.R --DATA_FILE_PATH=data/vehicles.csv --TRAIN_FILE_PATH=data/vehicles_train.csv --TEST_FILE_PATH=data/vehicles_test.csv --TARGET=price --REMOVE_OUTLIERS=YES --TRAIN_SIZE=0.9

python scripts/eda.py --DATA_FILE_PATH=data/vehicles_train.csv --EDA_FILE_PATH=results/figures/

Run these commands from the scripts directory:

python train_model.py
python test_model.py 

Apart from the change to the training script above (to restrict training on a tiny portion of the dataset for reproducibility), all the scripts work as expected given the correct parameters.

Let us know if you have any questions. Thanks!

ksedivyhaley commented 4 years ago

Hi serg,

The pipeline works with the two script changes you described, when using the corrected paths.

I will also note that I appreciate the summaries you print to command line of the train/test outputs!

However, when I try to run the long-form train_model.py (without the two changes to the script you mention) I get the error ValueError: Found unknown categories ['hennessey'] in column 0 during transform. My understanding is that this is fixed by the second change you indicated: using OneHotEncoder(handle_unknown='ignore'). Unfortunately, given that this error occurs even when running train_model.py on the full dataset indicates that it's a stable error - not one that you needed to correct for as part of making the "quick run" version.

My conclusion is that, for Milestone 2, I cannot consider the train_model.py to be correct. Happily, because I was able to confirm that test_model.py worked (given the fix to train_model.py) and that the problems with the other scripts are related to incorrect paths, I can still overall revise your mark.

Thank you for providing such clear instructions to help me revisit this Milestone!

pokrovskyy commented 4 years ago

Hello Kate,

Thanks for reaching back to us on this!

To your point on handle_unknown='ignore' - in fact we learned and added this specific parameter to address the quick pipeline issue on Milestone 3. However, this only relates to quick pipeline and does NOT affect all pipeline.

The reason is, it is sometimes possible that during quick pipeline training (when testing a hyperparameter set) you may run into such an issue because the tiny portion of the data may result in labels being divided between train and validation sets. However, if you'd run the all pipeline this would never happen because both train and validation (as well as test) sets would 100% include all values for all categorical features.

Said that, just to summarize - it does affect the quick pipeline but does not affect the all pipeline as you indicated above. Not sure if this helps with the grade but I would appreciate if it does :)

Thank you!

ksedivyhaley commented 4 years ago

Hi Serg - this is Milestone 2, not the milestone 3 pipeline with make quick. It affects the script pipeline as directly cloned from the repo (not adding the .sample(1000) selection) and run using the paths you described.

pokrovskyy commented 4 years ago

Wow indeed Kate you are correct, my apologies! I only saw this kind of error with the quick pipeline, and that was reasonable. I don't remember this happening on the full dataset, and I can't imagine why would it happen. Anyway, it was fixed later on. Closing this now!