Submission: GROUP 17: Giant_Pumpkins_Weight_Prediction

Submitting authors: @mahsasarafrazi, @shivajena, @Rowansiv, @imtvwy

Repository: https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction Report link: https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/blob/main/doc/pumpkin.html

Abstract/executive summary: This project is an attempt to build a prediction model using regression based machine learning models to estimate the weight of giant pumpkins based on their features such as year of cultivation, place, and over the top(ott) size in order to predict the next year’s winner of the GP competition. Different regression based prediction models such as Linear, Ridge and Random Forest were used for training and cross-validation on the training data. For the Ridge model, the hyperparameter (α) was optimised to return the best cross validation score. This model performed fairly well in predicting on the test data which led us to finalise the use of the model for prediction. The best score on cross validation sets is 0.6666134 and the mean test score is 0.6619808. The Random Forest model had similar cross-validation and test scores, but due to its high fit times, it was not chosen for this report. Therefore, for the purpose of reproducibility, we have decided to utilise the Ridge model as our prediction model. For better performance and precision, other models may also be tried on the data.

The data used for this project comes from BigPumpkins.com. The dataset is a public domain resource which pertains to the attributes of giant pumpkins grown in around 20 countries across the world in different regions. The raw data which was used in this project for the analysis can be found here : https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-19/pumpkins.csv

Editor: @mahsasarafrazi, @shivajena, @Rowansiv, @imtvwy Reviewer: @RamiroMejia, @riddhisansare, @stevenleung2018, @ruben1dlg

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

(Work in progress)

Data analysis review checklist

Reviewer: @stevenleung2018

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[1/2] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[1/3] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[1/2] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[1/2] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

About 1.5 hours.

Review Comments:

I like the fact that you have tried multiple models and chose the best model in the report, together with a well-thought rationale of the choice.
The whole data pipeline is present and the code written and organized in a readable fashion for me to follow the whole analysis from the beginning to the end.

There are, however, a few issues I have spotted:

Installation instructions: a. The list of dependencies is not available; b. Installation of environment fails on my computer. Here is the error command and error message:


(base) stevenprivate@StevenMac ~/mds/522/Giant_Pumpkins_Weight_Prediction (main)
$ conda env create -f environment.yaml 
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

pyqt5-sip==4.19.18=py39h415ef7b_8
graphite2==1.3.13=1000
openjpeg==2.4.0=hb211442_1
libffi==3.4.2=h8ffe710_5
selenium==3.141.0=py39hb82d6ee_1003
intel-openmp==2021.4.0=h57928b3_3556
pandas==1.3.4=py39h2e25243_1
win_inet_pton==1.1.0=py39hcbf5309_3
ucrt==10.0.20348.0=h57928b3_0
libxgboost==1.3.0=h0e60522_3
xorg-libx11==1.7.2=hcd874cb_0
sqlite==3.36.0=h8ffe710_2
setuptools==59.2.0=py39hcbf5309_0
zeromq==4.3.4=h0e60522_1
nodejs==14.17.4=h57928b3_0
preshed==3.0.6=py39h415ef7b_1
libsodium==1.0.18=h8d14728_1
xorg-libice==1.0.10=hcd874cb_0
murmurhash==1.0.6=py39h415ef7b_2
cairo==1.16.0=hb19e0ff_1008
cffi==1.15.0=py39h0878f49_0
jpeg==9d=h8ffe710_0
xorg-libxau==1.0.9=hcd874cb_0
libpng==1.6.37=h1d00b33_2
libwebp==1.2.1=h57928b3_0
pyqt==5.12.3=py39hcbf5309_8
statsmodels==0.13.1=py39h5d4886f_0
scikit-learn==1.0.1=py39he931e04_2
zlib==1.2.11=h8ffe710_1013
gettext==0.19.8.1=ha2e2712_1008
py-xgboost==1.3.0=py39hcbf5309_3
pyqtwebengine==5.12.1=py39h415ef7b_8
tornado==6.1=py39hb82d6ee_2
psutil==5.8.0=py39hb82d6ee_2
libclang==11.1.0=default_h5c34c98_1
libbrotlicommon==1.0.9=h8ffe710_6
xorg-libxt==1.2.1=hcd874cb_2
certifi==2021.10.8=py39hcbf5309_1
m2w64-gcc-libs-core==5.3.0=7
graphviz==2.49.3=hefbd956_0
libbrotlienc==1.0.9=h8ffe710_6
catalogue==2.0.6=py39hcbf5309_0
libxcb==1.13=hcd874cb_1004
harfbuzz==3.1.1=hc601d6f_0
pyzmq==22.3.0=py39he46f08e_1
zstd==1.5.0=h6255e5f_0
regex==2021.11.10=py39hb82d6ee_0
xorg-libxpm==3.5.13=hcd874cb_0
vega-cli==5.17.0=h0e60522_4
srsly==2.4.2=py39h415ef7b_0
pcre==8.45=h0e60522_0
pyrsistent==0.18.0=py39hb82d6ee_0
scipy==1.7.2=py39hc0c34ad_0
ipykernel==6.5.0=py39h832f523_1
xorg-libsm==1.2.3=hcd874cb_1000
cython-blis==0.7.5=py39h5d4886f_1
pthread-stubs==0.4=hcd874cb_1001
jbig==2.1=h8d14728_2003
lcms2==2.12=h2a16943_0
vc==14.2=hb210afc_5
matplotlib==3.5.0=py39hcbf5309_0
libwebp-base==1.2.1=h8ffe710_0
fonttools==4.28.1=py39hb82d6ee_0
debugpy==1.5.1=py39h415ef7b_0
importlib-metadata==4.8.2=py39hcbf5309_0
libbrotlidec==1.0.9=h8ffe710_6
chardet==4.0.0=py39hcbf5309_2
python==3.9.7=h7840368_3_cpython
libxml2==2.9.12=hf5bbc77_1
liblapack==3.9.0=12_win64_mkl
libblas==3.9.0=12_win64_mkl
llvmlite==0.36.0=py39ha0cd8c8_0
numpy==1.21.4=py39h6635163_0
libzlib==1.2.11=h8ffe710_1013
m2w64-gmp==6.1.0=2
pysocks==1.7.1=py39hcbf5309_4
tk==8.6.11=h8ffe710_1
libglib==2.70.1=h3be07f2_0
lerc==3.0=h0e60522_0
brotli==1.0.9=h8ffe710_6
msys2-conda-epoch==20160418=1
cryptography==36.0.0=py39h7bc7c5c_0
pywin32==302=py39hb82d6ee_2
libdeflate==1.8=h8ffe710_0
numba==0.53.0=py39h69f9ab1_0
vega-lite-cli==4.17.0=h57928b3_2
fribidi==1.0.10=h8d14728_0
catboost==1.0.3=py39hcbf5309_1
click==8.0.3=py39hcbf5309_1
m2w64-gcc-libs==5.3.0=7
xz==5.2.5=h62dcd97_1
jupyter_core==4.9.1=py39hcbf5309_1
matplotlib-base==3.5.0=py39h581301d_0
gts==0.7.6=h7c369d9_2
jedi==0.18.1=py39hcbf5309_0
ca-certificates==2021.10.8=h5b45459_0
brotli-bin==1.0.9=h8ffe710_6
libgd==2.3.3=h8bb91b0_0
markupsafe==2.0.1=py39hb82d6ee_1
lightgbm==3.3.1=py39h415ef7b_1
libtiff==4.3.0=hd413186_2
xorg-libxdmcp==1.1.3=hcd874cb_0
kiwisolver==1.3.2=py39h2e07f2f_1
m2w64-gcc-libgfortran==5.3.0=6
qt==5.12.9=h5909a2a_4
m2w64-libwinpthread-git==5.0.0.4634.697f757=2
xorg-xextproto==7.3.0=hcd874cb_1002
pyqt-impl==5.12.3=py39h415ef7b_8
fontconfig==2.13.1=h1989441_1005
pango==1.48.10=h33e4779_2
vs2015_runtime==14.29.30037=h902a5da_5
spacy==3.2.0=py39hefe7e4c_0
freetype==2.10.4=h546665d_1
lz4-c==1.9.3=h8ffe710_1
xorg-kbproto==1.0.7=hcd874cb_1002
mkl==2021.4.0=h0e2418a_729
brotlipy==0.7.0=py39hb82d6ee_1003
xorg-xproto==7.0.31=hcd874cb_1007
thinc==8.0.12=py39hefe7e4c_0
tbb==2021.4.0=h2d74725_1
pyqtchart==5.12=py39h415ef7b_8
cymem==2.0.6=py39h415ef7b_2
libiconv==1.16=he774522_0
pandoc==2.16.2=h8ffe710_0
icu==68.2=h0e60522_0
ipython==7.29.0=py39h832f523_2
expat==2.4.1=h39d44d4_0
pixman==0.40.0=h8ffe710_0
libcblas==3.9.0=12_win64_mkl
pydantic==1.8.2=py39hb82d6ee_2
pillow==8.4.0=py39h916092e_0
shap==0.40.0=py39h2e25243_0
xorg-libxext==1.3.4=hcd874cb_1

getopt-win32==0.1=h8ffe710_0


Thus the subsequent `conda activate pumpkin` also fails.

The bash run_all.sh runs fine after I conda install scikit-learn manually. I have 2 suggestions for you to improve on this: i. You should include a list like the example given by Tiffany (https://github.com/ttimbers/breast_cancer_predictor#dependencies); and ii. The list should come from which libraries you actually need to import in your Python scripts and your .Rmd files. By looking at your environmental.yaml file, I think it is too long. Please note that you only need to include the packages you actually explicitly load in your code by running the command conda env export -f environment.yaml --from-history (where the flag --from-history is the key), and conda will check the dependencies of those packages and upgrade them as necessary. After that, you should change the name of the environment to something meaningful, and also remove the last prefix line.

Community Guidelines: The following are missing: "2) Report issues or problems with the software 3) Seek support".
Final report: There is a block quotation under "Data" section about Great Pumpkin Commonweath (GPC) is a bit out of place and redundant since it was already mentioned in the "Introduction" section immediately before it.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @RamiroMejia

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

It seems like you have problems with missing values in some of the features. It would be interesting to know how many NAs you have in the data and include this in the EDA.
You mentioned :

For the numeric features, we used a simple imputer to insert the ‘median’ value for any missing or Null values as well as a standard scaler via a pipeline. For categorical features, we similarly used a simple imputer but instead of filling in values with the mean, we filled them in with the value ‘missing’. 
We then used one hot encoding to encode the categorical features.

Why did you decide to leave the NAs with 'missing' you did not use another strategy for the imputation like "most_frequent"

Seems like your Random Forest model takes a lot of time to train. Did you try DecisionTreeRegressor?
- Use RandomizedSearchCV to find the best parameters
- Also, it would be interesting if you can include a table of results of the models with the training and the test scores.
- How much your base model improve after the hyperparameter tuning?
It would be great to know what are the most important features in the model you can add a table with the most important coefficients affecting your target.
You can add the question of your analysis in the introduction text as part of your motivation.
Add a flow chart of how the scripts are executed in the README.md

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

(Work in progress)

Data analysis review checklist

Reviewer: @stevenleung2018

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?

[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[1/2] Installation instructions: Is there a clearly stated list of dependencies?

[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?

[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?

[1/3] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?

[x] Style guidelides: Does the code adhere to well known language style guides?

[x] Modularity: Is the code suitably abstracted into scripts and functions?

[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?

[x] Computational methods: Is all the source code required for the data analysis available?

[1/2] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?

[1/2] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?

[x] What is the question: Do the authors clearly state the research question being asked?

[x] Importance: Do the authors clearly state the importance for this research question?

[x] Background: Do the authors provide sufficient background information so that readers can understand the report?

[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?

[x] Conclusions: Are the conclusions presented by the authors correct?

[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

About 1.5 hours.

Review Comments:

I like the fact that you have tried multiple models and chose the best model in the report, together with a well-thought rationale of the choice.

The whole data pipeline is present and the code written and organized in a readable fashion for me to follow the whole analysis from the beginning to the end.

There are, however, a few issues I have spotted: 3. Installation instructions: a. The list of dependencies is not available; b. Installation of environment fails on my computer. Here is the error command and error message:
(base) stevenprivate@StevenMac ~/mds/522/Giant_Pumpkins_Weight_Prediction (main)
$ conda env create -f environment.yaml 
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound: 
  - pyqt5-sip==4.19.18=py39h415ef7b_8
  - graphite2==1.3.13=1000
  - openjpeg==2.4.0=hb211442_1
  - libffi==3.4.2=h8ffe710_5
  - selenium==3.141.0=py39hb82d6ee_1003
  - intel-openmp==2021.4.0=h57928b3_3556
  - pandas==1.3.4=py39h2e25243_1
  - win_inet_pton==1.1.0=py39hcbf5309_3
  - ucrt==10.0.20348.0=h57928b3_0
  - libxgboost==1.3.0=h0e60522_3
  - xorg-libx11==1.7.2=hcd874cb_0
  - sqlite==3.36.0=h8ffe710_2
  - setuptools==59.2.0=py39hcbf5309_0
  - zeromq==4.3.4=h0e60522_1
  - nodejs==14.17.4=h57928b3_0
  - preshed==3.0.6=py39h415ef7b_1
  - libsodium==1.0.18=h8d14728_1
  - xorg-libice==1.0.10=hcd874cb_0
  - murmurhash==1.0.6=py39h415ef7b_2
  - cairo==1.16.0=hb19e0ff_1008
  - cffi==1.15.0=py39h0878f49_0
  - jpeg==9d=h8ffe710_0
  - xorg-libxau==1.0.9=hcd874cb_0
  - libpng==1.6.37=h1d00b33_2
  - libwebp==1.2.1=h57928b3_0
  - pyqt==5.12.3=py39hcbf5309_8
  - statsmodels==0.13.1=py39h5d4886f_0
  - scikit-learn==1.0.1=py39he931e04_2
  - zlib==1.2.11=h8ffe710_1013
  - gettext==0.19.8.1=ha2e2712_1008
  - py-xgboost==1.3.0=py39hcbf5309_3
  - pyqtwebengine==5.12.1=py39h415ef7b_8
  - tornado==6.1=py39hb82d6ee_2
  - psutil==5.8.0=py39hb82d6ee_2
  - libclang==11.1.0=default_h5c34c98_1
  - libbrotlicommon==1.0.9=h8ffe710_6
  - xorg-libxt==1.2.1=hcd874cb_2
  - certifi==2021.10.8=py39hcbf5309_1
  - m2w64-gcc-libs-core==5.3.0=7
  - graphviz==2.49.3=hefbd956_0
  - libbrotlienc==1.0.9=h8ffe710_6
  - catalogue==2.0.6=py39hcbf5309_0
  - libxcb==1.13=hcd874cb_1004
  - harfbuzz==3.1.1=hc601d6f_0
  - pyzmq==22.3.0=py39he46f08e_1
  - zstd==1.5.0=h6255e5f_0
  - regex==2021.11.10=py39hb82d6ee_0
  - xorg-libxpm==3.5.13=hcd874cb_0
  - vega-cli==5.17.0=h0e60522_4
  - srsly==2.4.2=py39h415ef7b_0
  - pcre==8.45=h0e60522_0
  - pyrsistent==0.18.0=py39hb82d6ee_0
  - scipy==1.7.2=py39hc0c34ad_0
  - ipykernel==6.5.0=py39h832f523_1
  - xorg-libsm==1.2.3=hcd874cb_1000
  - cython-blis==0.7.5=py39h5d4886f_1
  - pthread-stubs==0.4=hcd874cb_1001
  - jbig==2.1=h8d14728_2003
  - lcms2==2.12=h2a16943_0
  - vc==14.2=hb210afc_5
  - matplotlib==3.5.0=py39hcbf5309_0
  - libwebp-base==1.2.1=h8ffe710_0
  - fonttools==4.28.1=py39hb82d6ee_0
  - debugpy==1.5.1=py39h415ef7b_0
  - importlib-metadata==4.8.2=py39hcbf5309_0
  - libbrotlidec==1.0.9=h8ffe710_6
  - chardet==4.0.0=py39hcbf5309_2
  - python==3.9.7=h7840368_3_cpython
  - libxml2==2.9.12=hf5bbc77_1
  - liblapack==3.9.0=12_win64_mkl
  - libblas==3.9.0=12_win64_mkl
  - llvmlite==0.36.0=py39ha0cd8c8_0
  - numpy==1.21.4=py39h6635163_0
  - libzlib==1.2.11=h8ffe710_1013
  - m2w64-gmp==6.1.0=2
  - pysocks==1.7.1=py39hcbf5309_4
  - tk==8.6.11=h8ffe710_1
  - libglib==2.70.1=h3be07f2_0
  - lerc==3.0=h0e60522_0
  - brotli==1.0.9=h8ffe710_6
  - msys2-conda-epoch==20160418=1
  - cryptography==36.0.0=py39h7bc7c5c_0
  - pywin32==302=py39hb82d6ee_2
  - libdeflate==1.8=h8ffe710_0
  - numba==0.53.0=py39h69f9ab1_0
  - vega-lite-cli==4.17.0=h57928b3_2
  - fribidi==1.0.10=h8d14728_0
  - catboost==1.0.3=py39hcbf5309_1
  - click==8.0.3=py39hcbf5309_1
  - m2w64-gcc-libs==5.3.0=7
  - xz==5.2.5=h62dcd97_1
  - jupyter_core==4.9.1=py39hcbf5309_1
  - matplotlib-base==3.5.0=py39h581301d_0
  - gts==0.7.6=h7c369d9_2
  - jedi==0.18.1=py39hcbf5309_0
  - ca-certificates==2021.10.8=h5b45459_0
  - brotli-bin==1.0.9=h8ffe710_6
  - libgd==2.3.3=h8bb91b0_0
  - markupsafe==2.0.1=py39hb82d6ee_1
  - lightgbm==3.3.1=py39h415ef7b_1
  - libtiff==4.3.0=hd413186_2
  - xorg-libxdmcp==1.1.3=hcd874cb_0
  - kiwisolver==1.3.2=py39h2e07f2f_1
  - m2w64-gcc-libgfortran==5.3.0=6
  - qt==5.12.9=h5909a2a_4
  - m2w64-libwinpthread-git==5.0.0.4634.697f757=2
  - xorg-xextproto==7.3.0=hcd874cb_1002
  - pyqt-impl==5.12.3=py39h415ef7b_8
  - fontconfig==2.13.1=h1989441_1005
  - pango==1.48.10=h33e4779_2
  - vs2015_runtime==14.29.30037=h902a5da_5
  - spacy==3.2.0=py39hefe7e4c_0
  - freetype==2.10.4=h546665d_1
  - lz4-c==1.9.3=h8ffe710_1
  - xorg-kbproto==1.0.7=hcd874cb_1002
  - mkl==2021.4.0=h0e2418a_729
  - brotlipy==0.7.0=py39hb82d6ee_1003
  - xorg-xproto==7.0.31=hcd874cb_1007
  - thinc==8.0.12=py39hefe7e4c_0
  - tbb==2021.4.0=h2d74725_1
  - pyqtchart==5.12=py39h415ef7b_8
  - cymem==2.0.6=py39h415ef7b_2
  - libiconv==1.16=he774522_0
  - pandoc==2.16.2=h8ffe710_0
  - icu==68.2=h0e60522_0
  - ipython==7.29.0=py39h832f523_2
  - expat==2.4.1=h39d44d4_0
  - pixman==0.40.0=h8ffe710_0
  - libcblas==3.9.0=12_win64_mkl
  - pydantic==1.8.2=py39hb82d6ee_2
  - pillow==8.4.0=py39h916092e_0
  - shap==0.40.0=py39h2e25243_0
  - xorg-libxext==1.3.4=hcd874cb_1
  - getopt-win32==0.1=h8ffe710_0
Thus the subsequent conda activate pumpkin also fails.

The bash run_all.sh runs fine after I conda install scikit-learn manually. I have 2 suggestions for you to improve on this: i. You should include a list like the example given by Tiffany (https://github.com/ttimbers/breast_cancer_predictor#dependencies); and ii. The list should come from which libraries you actually need to import in your Python scripts and your .Rmd files. By looking at your environmental.yaml file, I think it is too long. Please note that you only need to include the packages you actually explicitly load in your code by running the command conda env export -f environment.yaml --from-history (where the flag --from-history is the key), and conda will check the dependencies of those packages and upgrade them as necessary. After that, you should change the name of the environment to something meaningful, and also remove the last prefix line.

Community Guidelines: The following are missing: "2) Report issues or problems with the software 3) Seek support".

Final report: There is a block quotation under "Data" section about Great Pumpkin Commonweath (GPC) is a bit out of place and redundant since it was already mentioned in the "Introduction" section immediately before it.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks Steven for the detailed review and feedbacks. We are planning on implementing your feedbacks in a phased manner, and here is a tentative list of some of the immediate changes we are working on:

Documentation: On dependencies, we really appreciate your suggestions to improve. We have improved our environment file and finalising a clear and shorter list of dependencies. For example, we have made changes in the environment file to make it platform independent and solve some of the issues which came up in the first release. Here is a link to the specific PR. We will also be appending a list of dependencies in the readme as per the environment file. We are also trying to improve Community guidelines by amending our contribution file.

Code Quality: We are planning to work on building and improving the tests which should be ready by next week.

Reproducibility: We understand your inputs regarding the environment issue and working on it as stated above.

Data analysis review checklist

Reviewer: @riddhisansare

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.7 hours.

Review Comments:

I appreciate that the code is legible and understandable throughout the project.
The README file gives a subtle description of the project however, including a flowchart for the executed scripts and pipelines would give users a better overall review of the process.
It would be nice to add the analysis question in the introduction text as part of your motivation.
Adding a line to download the data in the EDA file would avoid trouble if the user has not downloaded the data.
The report gives a good overall understanding of the process. Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
Adding a file that describes the step-by-step process of project setup would be beneficial for a completely new user with no coding background. I would recommend adding a line to download your date within your EDA jupyter notebook, so when other people run your script, they can have the data if not downloaded earlier.
Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @RamiroMejia

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?

[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?

[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?

[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?

[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?

[x] Style guidelides: Does the code adhere to well known language style guides?

[x] Modularity: Is the code suitably abstracted into scripts and functions?

[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?

[x] Computational methods: Is all the source code required for the data analysis available?

[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?

[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?

[x] What is the question: Do the authors clearly state the research question being asked?

[x] Importance: Do the authors clearly state the importance for this research question?

[x] Background: Do the authors provide sufficient background information so that readers can understand the report?

[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?

[x] Conclusions: Are the conclusions presented by the authors correct?

[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

It seems like you have problems with missing values in some of the features. It would be interesting to know how many NAs you have in the data and include this in the EDA.

You mentioned :
For the numeric features, we used a simple imputer to insert the ‘median’ value for any missing or Null values as well as a standard scaler via a pipeline. For categorical features, we similarly used a simple imputer but instead of filling in values with the mean, we filled them in with the value ‘missing’. 
We then used one hot encoding to encode the categorical features.
Why did you decide to leave the NAs with 'missing' you did not use another strategy for the imputation like "most_frequent"

Seems like your Random Forest model takes a lot of time to train. Did you try DecisionTreeRegressor?

Use RandomizedSearchCV to find the best parameters

Also, it would be interesting if you can include a table of results of the models with the training and the test scores.

How much your base model improve after the hyperparameter tuning?

It would be great to know what are the most important features in the model you can add a table with the most important coefficients affecting your target.

You can add the question of your analysis in the introduction text as part of your motivation.

Add a flow chart of how the scripts are executed in the README.md

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks Rameiro for the inputs, truly appreciate your detailed observations. Following are our immediate plans to implement your feedbacks and suggestions:

Code quality: Tests - we are targetting next week to improve upon the tests.
Reproducibility- we have improved the environment file, made it OS independent (see PR) and trying to come up with glitch free dependencies installations.
Analysis report - we are improving it as per your and TA feedbacks and will notify soon on the next week's release.
EDA observations - Actually yes, we did have problems with missing values, and we have used median strategy for numerical features and missing for categorical to deal with it which you can see it in our report. Our categorical features like place/country/state of origin are very very context specific and therefore, using most_frequent would be misleading. That is the reason why we didnt use it. We will get back on the rest of the comments soon.

Data analysis review checklist

Reviewer: @ruben1dlg

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[1/2] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[1/2] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

In general, I think the structure of the repo is well done! I would recommend having an .md file for the final report already in the repo for visualization purposes only. Also, I would probably change the name of the final report, just so it is clearer that it is in fact the final results that you got.
I think it would be interesting to dig in deeper into the conclusions of the results of the model. It is shown that the model performs well and the score that it got, but we do not get to hear any conclusions about the features and if your assumptions held on to be true.
I feel like the list of dependencies that we need to install is too long, and it should be limited to the packages that are actually needed for running the analysis.
I guess this will be added later, but it would be a good idea to have tests in your code as well, just to make sure that any manual parts work properly.
I liked the introduction of the report. It was fun to read, and it gives a very good idea on what the purpose of the analysis is.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thanks everyone for your valuable feedback so that we can improve our project. We have made the following changes in regarding to your comment:

Regarding comment 3 in this issue, we have
- recreated the environment file to make it platform independent https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/0051f618f985f6cbf0247eb80f8862bb4e6968da
- listed the package dependencies in the README https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/c6a98ed2f10fc6fd68c3ab40cad9f96986a16f0a
Regarding comment 5 in this issue, we have updated the EDA document to add figure caption and reposition the plots below the reasoning https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/01f0083e34ff1af0313482a38af613458d1fbf0c
Regarding the comments in this issue about the final report, we have
- added caption to all figures and tables
- resized the figure
- report the scoring metric https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/f3584e585c358cd235c8b924297c475becac62bf
Regarding comment 5 in this issue, we have
- removed the duplicated content https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/9eedc34ba2861488619c118d64ba5a16446818bc
- added a sticky navigation panel https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/15561fb62a88cee4bc3b86a50fefcee330d24b1e
Regarding comment 2 in this issue, we have added a "Critique, Limitations and Future Improvements" section in the final report https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/dfafe4f9c1cddc308d2153c9880acc27d04352cd
Regarding the comments of adding process flow in this issue as well as the other issue, we have added a makefile dependency diagram in the README file https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/66509148343ec1800c0d7d70c7b0e958cb76ba00

UBC-MDS / data-analysis-review-2021