UBC-MDS / data-analysis-review-2023

0 stars 0 forks source link

Submission: Group 20: Red Wine Quality Prediction #3

Open AlysenTownsley opened 7 months ago

AlysenTownsley commented 7 months ago

Submitting authors: @alexzhang0825 @sungg888 @AlysenTownsley @nicolebid

Repository: https://github.com/UBC-MDS/Red-Wine-Quality-Prediction Report link: https://ubc-mds.github.io/Red-Wine-Quality-Prediction/ Abstract/executive summary: In this project our group seeks to use machine learning algorithms to predict wine quality (scale of 0 to 10) using physiochemical properties of the liquid. We use a train-test split and cross-validation to simulate the model encountering unseen data. We use and tune the parameters of several classification models: logistic regression, decision tree, kNN, and SVM RBF to see which one has the highest accuracy, and then deploy the winner onto the test set. The final test set accuracy is around 62 percent. Depending on the standard, this can be decent or poor. However, a more important note is that for the really extreme quality ones (below 5 or above 6), the model was unable to identify quite a few of them correctly, suggesting that it is not very robust to outliers. We include a final discussion section on some of the potential causes for this performance as well as proposed solutions for any future analysis.

Editor: @ttimbers Reviewer: Shizhe Zhang, Tony Shum, Jake Barnabe, Yiwei Zhang

zywkloo commented 7 months ago

Data analysis review checklist

Reviewer: zywkloo

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. As shown in the image below, the src directory only contains 1 file. In that case, all test cases trying to call the helpers from src/ would not be executed, as they will fail in the dependency import stage (most likely). image

  2. The test files could have more consistent naming standard, like test-word1_word2_word3.pyimage

  3. May consider referencing the environment.yaml in the docker file, to enhance the code reusability. For example:

    FROM quay.io/jupyter/minimal-notebook:2023-11-22
    
    WORKDIR /home/jovyan
    
    COPY environment.yaml .
    
    RUN conda env update --file environment.yaml

    Despite some minor naming and importing issues, this project stands out for comprehensive documentation, code quality, reproducibility, and thorough analysis reporting.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

jbarns14 commented 7 months ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. The report is very thorough and explains the analysis in great detail. However, it features many technical details that may be too complex to add to the understanding of the average reader. For instance, the section at the end of the introduction outlining which software packages were used in the analysis may not be meaningful to many readers who aren't technically trained in data science. It may be worth considering removing that section to keep the report concise and understandable to all readers.

  2. In an effort to improve reproducibility, it may be worth adding edge cases to the function modules to make sure the functions are, for example, passed inputs of the correct data type, or are non-null, etc. One example of a useful edge case could be raising an error if the dataframe read in in the data_split.py script does not have the correct number of columns.

  3. The repository is structured in a largely clear and easy-to-navigate manner. Though, I noticed that all but one of the scripts are stored in the scripts directory. It may be a good idea to save all the scripts in the scripts folder as it may reduce the chances of readers missing the read_view.py script (as I almost did).

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

zhang-shizhe commented 7 months ago

Data analysis review checklist

Reviewer: zhang-shizhe

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. The README.md file is well-detailed and organized. Great work on this!
  2. It seems like there's a typo in the test_set_deployment.py script that caused an error during its execution.
    Traceback (most recent call last):
    File "/home/jovyan/work/scripts/test_set_deployment.py", line 72, in <module>
    test_set_deployment()
    File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/jovyan/work/scripts/test_set_deployment.py", line 58, in test_set_deployment
    X_test = pd.read_csv((x_test_folder + 'x_test.csv'))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
    self.handles = get_handle(
                   ^^^^^^^^^^^
    File "/opt/conda/lib/python3.11/site-packages/pandas/io/common.py", line 863, in get_handle
    handle = open(
             ^^^^^
    FileNotFoundError: [Errno 2] No such file or directory: '../results/tables/x_test.csv'
  3. Given the models that being used, it might be worth trying to apply a log transformation to some of the features, which seem to have a pronounced skew. Features with high variance or long tails can dominate the distance calculations in the feature space, so transforming these features to approximate a normal distribution might help improve the performance.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

tonyshumlh commented 7 months ago

Data analysis review checklist

Reviewer: @tonyshumlh

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.