UBC-MDS / data-analysis-review-2023

0 stars 0 forks source link

Submission: <GROUP 16: portugal_white_wine_quality_predictor> #17

Open jokittipong opened 7 months ago

jokittipong commented 7 months ago

Submitting authors: <@jokittipong> <@sho-i98> <@Nicole-Tu97>

Repository: https://github.com/UBC-MDS/portugal_white_wine_quality_predictor_py Report link: https://rawcdn.githack.com/UBC-MDS/portugal_white_wine_quality_predictor_py/8f098a7da456a3dcbe0863817da5203760776339/report/_build/_page/portugal_white_wine_quality_predictor_report/html/portugal_white_wine_quality_predictor_report.html Abstract/executive summary: We tried to make the classification model using the Polynomial Regression with Ridge Regularization algorithm with Randomized Search Hyperparameters which can predict Portugal white wine quality rating (on scale 0-10) through the physicochemical properties of the test wine. The model has trained on the Portugal white wine data set with 4898 observations. In the conclusion, the model performance is not quite good enough both on training data and on an unseen test data set with the test score at around 0.32 with the average train at 0.36 and the average test at 0.33 also with high root MSE and MSE (Mean Squared Error).

The reason we suspect the model cannot predict well is that the wine quality can be judge widely and vary depends on each individual preference taste. Moreover, there is no standard on the taste, for example, high or low in acidity or alcohol level or sulfur level cannot indicate the wine is in good quality or not (It can be both ways!!). As such, we believe this model is at, or close to, the starter required for studying further and could run more collected data to analyze the combination of physicochemical properties which will announce quality of the wine, although more researches need to improve the model performance and understand the characteristics of incorrectly predicted pattern would be in need to investigate further.

This data set used in this project is related to white vinho verde wine samples from the north of Portugal created By P. Cortez, A. Cerdeira, Fernando Almeida, Telmo Matos, J. Reis. 2009. The dataset was sourced from website for downloading these datasets is the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/dataset/186/wine+quality). In addition, these datasets stored the physicochemical properties data on wines and the quality rating to compare and make the quality prediction model.

Editor: @jokittipong Reviewer: <@alanpow> <@joeywwwu> <@srfrew> <@jinyz8888>

jinyz8888 commented 7 months ago

Data analysis review checklist

Reviewer: jinyz8888

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Pretty good report. It is better to: 1) Remove numbers at the beginning of the report 2) Optimize the structure of the repository. For example, remove .cache/matplotlib if possible 3) Optimize the model in the future.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

srfrew commented 7 months ago

Data analysis review checklist

Reviewer: @srfrew

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

============================== warnings summary =============================== ../../../opt/conda/lib/python3.11/site-packages/click/core.py:1155 /opt/conda/lib/python3.11/site-packages/click/core.py:1155: PytestCollectionWarning: cannot collect 'test_and_deploy' because it is not a function. def call(self, *args: t.Any, **kwargs: t.Any) -> t.Any:

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================= 1 warning in 0.77s ==============================

- I unfortunately wasn't able to run `eda.py` in Docker Compose, it threw the following error. 

/opt/conda/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): Traceback (most recent call last): File "/home/jovyan/work/script/eda.py", line 91, in eda_script() File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/work/script/eda.py", line 79, in eda_script sns.histplot(white_train[column], kde=True, color='pink') File "/opt/conda/lib/python3.11/site-packages/seaborn/distributions.py", line 1438, in histplot p.plot_univariate_histogram( File "/opt/conda/lib/python3.11/site-packages/seaborn/distributions.py", line 431, in plot_univariate_histogram all_data = self.comp_data.dropna() ^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/seaborn/_oldcore.py", line 1119, in comp_data with pd.option_context('mode.use_inf_as_null', True): File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 478, in enter self.undo = [(pat, _get_option(pat)) for pat, val in self.ops] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 478, in self.undo = [(pat, _get_option(pat)) for pat, val in self.ops] ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 146, in _get_option key = _get_single_key(pat, silent) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 132, in _get_single_key raise OptionError(f"No such keys(s): {repr(pat)}") pandas._config.config.OptionError: No such keys(s): 'mode.use_inf_as_null'


- In running `fit_polynomial_regression.py` I noticed a significant amount of errors thrown by ridge around matrix singularity. This may indicate significant multicollinearity or identical columns, and removing duplicate features or implementing lasso regression to select features may help with this!

#### Attribution

This was derived from the [JOSE review checklist](https://openjournals.readthedocs.io/en/jose/review_checklist.html) and the ROpenSci review checklist.
joeywwwu commented 7 months ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1h

Review Comments:

Strengths

  1. The report is well-structured, providing a clear overview, introduction, methods, discussion, results, and references, which makes it easy to follow.
  2. The use of Polynomial Regression with Ridge Regularization and Randomized Search for Hyperparameters is well-explained.
  3. The project discusses the limitations of their model and suggests future research directions, demonstrating a critical understanding of their work.   Areas for Improvement
  4. The model's performance is moderate (test score around 0.32), if possible, suggesting experimentation with ensemble methods that combine predictions from multiple models to improve accuracy and robustness.
  5. The report acknowledges the subjectivity in wine quality assessment but does not provide any suggestions or proposals on how to address this challenge in their modeling approach.
  6. If possible, work with wine experts or sommeliers to gain insights that could influence feature selection or model interpretation. Their expertise may provide valuable context that is not apparent from the data alone.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

alanpow commented 7 months ago

Data analysis review checklist

Reviewer: Alanpow

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 Hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. The report format is a little strange, the first section under the title displays 5 boxes with numbers. Not very sure what the purpose us but it does not look good and and I believe it was left in by accident. I would also include a little more information in the methods section explaining what polynomial regression is, maybe include a formula or mathematical equation to show what the model is made of - because right now it is explained that it is better than linear regression because it can pick up on more complex models - this is true but it does not really inform a reader what a polynomial regression model is and the nuance behind it. I would also tie the results of the report back to the over all report question, and have a formal conclusion - it looks like the results section only really referenced the model scores but no interpretation of the final model was made nor any interpretation of the results for the given problem at hand.

  2. The report does not include much information as to why the question being explored is important and what the motivation is behind wanting to explore/predict wine ratings of Portuguese wine. You guys did a great job giving background information as to what aspects of wine contribute to ratings and what merits at good wine, however, there is some reasoning or drive behind the report holistically. Maybe attached it to a business angle to explore how these ratings could help sell wine, or how we could maximize wine rating to maximize profit.

  3. I was able to run the docker file and start running your scripts to reproduce your analysis, however, every script I ran after ingesting the raw data errored out - I would also recommend fixing up the ReadMe file, specifically the instructions for running the scripts, because it is written as regular text rather than inline code that is easier to follow. This is the script which lead to the errors: 522 Peer Review: Script Commands Failing: (base) jovyan@73c381c0b26d:~/work$ python script/eda.py data/Processed/white_train.csv produced the head of the data set as well as the correlation matrix, however, afterwards I was left with errors: `[12 rows x 12 columns] /opt/conda/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead if pd.api.types.is_categorical_dtype(vector): Traceback (most recent call last): File "/home/jovyan/work/script/eda.py", line 91, in eda_script() File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/work/script/eda.py", line 79, in eda_script sns.histplot(white_train[column], kde=True, color='pink') File "/opt/conda/lib/python3.11/site-packages/seaborn/distributions.py", line 1438, in histplot p.plot_univariate_histogram( File "/opt/conda/lib/python3.11/site-packages/seaborn/distributions.py", line 431, in plot_univariate_histogram all_data = self.comp_data.dropna() ^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/seaborn/_oldcore.py", line 1119, in comp_data with pd.option_context('mode.use_inf_as_null', True): File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 478, in enter self.undo = [(pat, _get_option(pat)) for pat, val in self.ops] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 478, in self.undo = [(pat, _get_option(pat)) for pat, val in self.ops] ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 146, in _get_option key = _get_single_key(pat, silent) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/pandas/_config/config.py", line 132, in _get_single_key raise OptionError(f"No such keys(s): {repr(pat)}") pandas._config.config.OptionError: No such keys(s): 'mode.use_inf_as_null' (base) jovyan@73c381c0b26d:~/work$ python script/fit_polynomial_regression.py data/Processed/white_train.csv data/Processed/white_test.csv /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=7.94006e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=7.61806e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=7.86797e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=7.68219e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=9.65767e-17): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=5.87782e-22): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:239: UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead. warnings.warn( /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:239: UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead. warnings.warn( /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:239: UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead. warnings.warn( /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:239: UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead. warnings.warn( /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:239: UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead. warnings.warn( /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=5.85444e-22): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=1.16842e-21): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T /opt/conda/lib/python3.11/site-packages/sklearn/linear_model/_ridge.py:200: LinAlgWarning: Ill-conditioned matrix (rcond=6.18235e-22): result may not be accurate. return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T Traceback (most recent call last): File "/home/jovyan/work/script/fit_polynomial_regression.py", line 80, in polynomial_regression() File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jovyan/work/script/fit_polynomial_regression.py", line 67, in polynomial_regression random_search.fit(x_train_w, y_train_w) File "/opt/conda/lib/python3.11/site-packages/sklearn/base.py", line 1152, in wrapper return fit_method(estimator, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 898, in fit self._run_search(evaluate_candidates) File "/opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 1809, in _run_search evaluate_candidates( File "/opt/conda/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 845, in evaluate_candidates out = parallel( ^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 65, in call return super().call(iterable_with_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 1952, in call return output if self.return_generator else list(output) ^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 1595, in _get_outputs yield from self._retrieve() File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 1699, in _retrieve self._raise_error_fast() File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 1734, in _raise_error_fast error_job.get_result(self.timeout) File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 736, in get_result return self._return_or_raise() ^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/joblib/parallel.py", line 754, in _return_or_raise raise self._result joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)} (base) jovyan@73c381c0b26d:~/work$ `

Look into this because it is erroring out as well as timing out.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.