UBC-MDS / DSCI_522_Group-308_Used-Cars

This project attempts to build a regression model to predict price of used cars based on numerous features of the car
MIT License
2 stars 6 forks source link

[URGENT] Pipeline does not run #54

Closed ksedivyhaley closed 4 years ago

ksedivyhaley commented 4 years ago
size.png saved to results/figures/
Traceback (most recent call last):
  File "scripts/eda.py", line 172, in <module>
    main(opt["--DATA_FILE_PATH"], opt["--EDA_FILE_PATH"])
  File "scripts/eda.py", line 34, in main
    make_bars(data, eda_file_path)
  File "scripts/eda.py", line 153, in make_bars
    vehicles_graph = data[['price', categorical_features[i]]].groupby(by = categorical_features[i])\
  File "C:\Users\7ks42\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2986, in __getitem__
    indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
  File "C:\Users\7ks42\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1285, in _convert_to_indexer
    return self._get_listlike_indexer(obj, axis, **kwargs)[1]
  File "C:\Users\7ks42\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1092, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "C:\Users\7ks42\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1185, in _validate_read_indexer
    raise KeyError("{} not in index".format(not_found))
KeyError: "['type'] not in index"
make: *** [Makefile:38: results/figures/condition.png] Error 1

Note also when running make all:

>>> Testing model...
python scripts/test_model.py --TEST_SIZE=1 --MODEL_DUMP_PATH=results/model.pic
ERROR - No model file. Run scripts/train_model.py first

make quick does NOT have this issue (but still fails due to the EDA error above).

pokrovskyy commented 4 years ago

Thanks Kate, issue with full model not being considered in Makefile was resolved, soon will be PRed.

EDA issue to be investigated. We tried running the whole make quick pipeline on our computers before the release - no issues on our side. There might be some issues with dependencies, as setting up some of them was troublesome (like Selenium side-by-side) This should be resolved by Docker.

Said that, should we care about figuring out this potential environment issue if we will be moving this to Docker anyway?

ksedivyhaley commented 4 years ago

Let me first confirm it's a configuration issue - I'm able to run selenium in other pipelines and don't see any obvious issues with other dependencies.

If this is the case, then Docker should indeed take care of it.

firasm commented 4 years ago

This is what I get when I run make all

▶ make all
>>> Running download script...
python scripts/download.py --DATA_FILE_PATH=data/vehicles.csv --DATA_FILE_URL=http://mds.dev.synnergia.com/uploads/vehicles.csv --DATA_FILE_HASH=06e7bd341eebef8e77b088d2d3c54585
  File "scripts/download.py", line 47
    print("Checking cached data file... ", end='')
                                              ^
SyntaxError: invalid syntax
make: *** [data/vehicles.csv] Error 1
Screen Shot 2020-02-05 at 11 02 28 AM

Yes, please fix your Makefile so that it runs and all your scripts are correct.

firasm commented 4 years ago

Ah fixed the issue above, a configuration issue on my side. No need to fix the above issue.

firasm commented 4 years ago

Still get this issue:

>>> Testing model...
python3 scripts/test_model.py --TEST_SIZE=1 --MODEL_DUMP_PATH=results/model.pic
Loading model...
Loading test data...
Sampled test set to 74795 observations
Running model predictions...
Traceback (most recent call last):
  File "scripts/test_model.py", line 142, in <module>
    float(opt["--TEST_SIZE"]))
  File "scripts/test_model.py", line 106, in main
    test_score = model.score(X_test, y_test)
  File "/usr/local/lib/python3.7/site-packages/sklearn/utils/metaestimators.py", line 116, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/sklearn/pipeline.py", line 615, in score
    Xt = transform.transform(Xt)
  File "/usr/local/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 588, in transform
    Xs = self._fit_transform(X, None, _transform_one, fitted=True)
  File "/usr/local/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 457, in _fit_transform
    self._iter(fitted=fitted, replace_strings=True), 1))
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 1007, in __call__
    while self.dispatch_one_batch(iterator):
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 754, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 209, in apply_async
    result = ImmediateResult(func)
  File "/usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 590, in __init__
    self.results = batch()
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 256, in __call__
    for func, args, kwargs in self.items]
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 256, in <listcomp>
    for func, args, kwargs in self.items]
  File "/usr/local/lib/python3.7/site-packages/sklearn/pipeline.py", line 707, in _transform_one
    res = transformer.transform(X)
  File "/usr/local/lib/python3.7/site-packages/sklearn/pipeline.py", line 557, in _transform
    Xt = transform.transform(Xt)
  File "/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform
    X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
  File "/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform
    raise ValueError(msg)
ValueError: Found unknown categories ['hennessey'] in column 0 during transform
make: *** [results/test_results_sample.csv] Error 1
pokrovskyy commented 4 years ago

Firas, you're getting the issue with pre-trained model.pic because it is not compatible with the latest test script due to some changes and optimizations to wrangling / ML preprocessing pipeline. It took us around 4 hours to train this original model on 50% of data and we did not have a chance to update it yet (with the revised ML pipeline)

We mainly shifted our focus to make quick as it is much more reproducible for now, and will re-run the whole make all later on full dataset before the final release to update the final model (we expect that to run for 15 hours from Friday evening and until Saturday morning)

We made a note in README that make all takes considerably more time and encouraged using make quick instead. Once again, the make all target will be revised and tested before the final release.

Thanks!

firasm commented 4 years ago

Thanks @pokrovskyy

make all does not work as you mentioned above.

make quick works AFTER you install orca which isn’t listed as a dependency here: https://github.com/UBC-MDS/DSCI_522_Group-308_Used-Cars

firasm commented 4 years ago

Probably safe to close now as it's obsolete with docker for milestone4

pokrovskyy commented 4 years ago

Right, thanks, this has been handled. We also added plotly-orca to dependencies tree in case someone will want to run it outside Docker