Closed pokrovskyy closed 4 years ago
Hi serg,
The pipeline works with the two script changes you described, when using the corrected paths.
I will also note that I appreciate the summaries you print to command line of the train/test outputs!
However, when I try to run the long-form train_model.py
(without the two changes to the script you mention) I get the error ValueError: Found unknown categories ['hennessey'] in column 0 during transform
. My understanding is that this is fixed by the second change you indicated: using OneHotEncoder(handle_unknown='ignore')
. Unfortunately, given that this error occurs even when running train_model.py
on the full dataset indicates that it's a stable error - not one that you needed to correct for as part of making the "quick run" version.
My conclusion is that, for Milestone 2, I cannot consider the train_model.py
to be correct. Happily, because I was able to confirm that test_model.py
worked (given the fix to train_model.py
) and that the problems with the other scripts are related to incorrect paths, I can still overall revise your mark.
Thank you for providing such clear instructions to help me revisit this Milestone!
Hello Kate,
Thanks for reaching back to us on this!
To your point on handle_unknown='ignore'
- in fact we learned and added this specific parameter to address the quick pipeline issue on Milestone 3. However, this only relates to quick
pipeline and does NOT affect all
pipeline.
The reason is, it is sometimes possible that during quick pipeline training (when testing a hyperparameter set) you may run into such an issue because the tiny portion of the data may result in labels being divided between train and validation sets. However, if you'd run the all
pipeline this would never happen because both train and validation (as well as test) sets would 100% include all values for all categorical features.
Said that, just to summarize - it does affect the quick
pipeline but does not affect the all
pipeline as you indicated above. Not sure if this helps with the grade but I would appreciate if it does :)
Thank you!
Hi Serg - this is Milestone 2, not the milestone 3 pipeline with make quick
. It affects the script pipeline as directly cloned from the repo (not adding the .sample(1000)
selection) and run using the paths you described.
Wow indeed Kate you are correct, my apologies! I only saw this kind of error with the quick
pipeline, and that was reasonable. I don't remember this happening on the full dataset, and I can't imagine why would it happen. Anyway, it was fixed later on. Closing this now!
Kate,
Following our discussion in class, in order to be able to quickly reproduce the Milestone 2 train / test pipeline please replace the 2 lines in
scripts/train_model.py
:Run these commands from the root directory:
Run these commands from the scripts directory:
Apart from the change to the training script above (to restrict training on a tiny portion of the dataset for reproducibility), all the scripts work as expected given the correct parameters.
Let us know if you have any questions. Thanks!