Closed ksedivyhaley closed 4 years ago
Thanks Kate, issue with full model not being considered in Makefile
was resolved, soon will be PRed.
EDA issue to be investigated. We tried running the whole make quick
pipeline on our computers before the release - no issues on our side. There might be some issues with dependencies, as setting up some of them was troublesome (like Selenium side-by-side) This should be resolved by Docker.
Said that, should we care about figuring out this potential environment issue if we will be moving this to Docker anyway?
Let me first confirm it's a configuration issue - I'm able to run selenium in other pipelines and don't see any obvious issues with other dependencies.
If this is the case, then Docker should indeed take care of it.
This is what I get when I run make all
▶ make all
>>> Running download script...
python scripts/download.py --DATA_FILE_PATH=data/vehicles.csv --DATA_FILE_URL=http://mds.dev.synnergia.com/uploads/vehicles.csv --DATA_FILE_HASH=06e7bd341eebef8e77b088d2d3c54585
File "scripts/download.py", line 47
print("Checking cached data file... ", end='')
^
SyntaxError: invalid syntax
make: *** [data/vehicles.csv] Error 1
Yes, please fix your Makefile so that it runs and all your scripts are correct.
Ah fixed the issue above, a configuration issue on my side. No need to fix the above issue.
Still get this issue:
>>> Testing model...
python3 scripts/test_model.py --TEST_SIZE=1 --MODEL_DUMP_PATH=results/model.pic
Loading model...
Loading test data...
Sampled test set to 74795 observations
Running model predictions...
Traceback (most recent call last):
File "scripts/test_model.py", line 142, in <module>
float(opt["--TEST_SIZE"]))
File "scripts/test_model.py", line 106, in main
test_score = model.score(X_test, y_test)
File "/usr/local/lib/python3.7/site-packages/sklearn/utils/metaestimators.py", line 116, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/sklearn/pipeline.py", line 615, in score
Xt = transform.transform(Xt)
File "/usr/local/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 588, in transform
Xs = self._fit_transform(X, None, _transform_one, fitted=True)
File "/usr/local/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 457, in _fit_transform
self._iter(fitted=fitted, replace_strings=True), 1))
File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 1007, in __call__
while self.dispatch_one_batch(iterator):
File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 835, in dispatch_one_batch
self._dispatch(tasks)
File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 754, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 209, in apply_async
result = ImmediateResult(func)
File "/usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 590, in __init__
self.results = batch()
File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 256, in __call__
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 256, in <listcomp>
for func, args, kwargs in self.items]
File "/usr/local/lib/python3.7/site-packages/sklearn/pipeline.py", line 707, in _transform_one
res = transformer.transform(X)
File "/usr/local/lib/python3.7/site-packages/sklearn/pipeline.py", line 557, in _transform
Xt = transform.transform(Xt)
File "/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 390, in transform
X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
File "/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 124, in _transform
raise ValueError(msg)
ValueError: Found unknown categories ['hennessey'] in column 0 during transform
make: *** [results/test_results_sample.csv] Error 1
Firas, you're getting the issue with pre-trained model.pic
because it is not compatible with the latest test script due to some changes and optimizations to wrangling / ML preprocessing pipeline. It took us around 4 hours to train this original model on 50% of data and we did not have a chance to update it yet (with the revised ML pipeline)
We mainly shifted our focus to make quick
as it is much more reproducible for now, and will re-run the whole make all
later on full dataset before the final release to update the final model (we expect that to run for 15 hours from Friday evening and until Saturday morning)
We made a note in README that make all
takes considerably more time and encouraged using make quick
instead. Once again, the make all
target will be revised and tested before the final release.
Thanks!
Thanks @pokrovskyy
make all
does not work as you mentioned above.
make quick
works AFTER you install orca
which isn’t listed as a dependency here: https://github.com/UBC-MDS/DSCI_522_Group-308_Used-Cars
Probably safe to close now as it's obsolete with docker for milestone4
Right, thanks, this has been handled. We also added plotly-orca
to dependencies tree in case someone will want to run it outside Docker
Note also when running make all:
make quick does NOT have this issue (but still fails due to the EDA error above).