Open imtvwy opened 2 years ago
(Work in progress)
About 1.5 hours.
There are, however, a few issues I have spotted:
(base) stevenprivate@StevenMac ~/mds/522/Giant_Pumpkins_Weight_Prediction (main)
$ conda env create -f environment.yaml
Collecting package metadata (repodata.json): done
Solving environment: failed
ResolvePackageNotFound:
Thus the subsequent `conda activate pumpkin` also fails.
The bash run_all.sh
runs fine after I conda install scikit-learn
manually.
I have 2 suggestions for you to improve on this:
i. You should include a list like the example given by Tiffany (https://github.com/ttimbers/breast_cancer_predictor#dependencies); and
ii. The list should come from which libraries you actually need to import in your Python scripts and your .Rmd
files. By looking at your environmental.yaml
file, I think it is too long. Please note that you only need to include the packages you actually explicitly load in your code by running the command conda env export -f environment.yaml --from-history
(where the flag --from-history
is the key), and conda
will check the dependencies of those packages and upgrade them as necessary. After that, you should change the name of the environment to something meaningful, and also remove the last prefix
line.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
It seems like you have problems with missing values in some of the features. It would be interesting to know how many NAs you have in the data and include this in the EDA.
You mentioned :
For the numeric features, we used a simple imputer to insert the ‘median’ value for any missing or Null values as well as a standard scaler via a pipeline. For categorical features, we similarly used a simple imputer but instead of filling in values with the mean, we filled them in with the value ‘missing’.
We then used one hot encoding to encode the categorical features.
Seems like your Random Forest model takes a lot of time to train. Did you try DecisionTreeRegressor
?
It would be great to know what are the most important features in the model you can add a table with the most important coefficients affecting your target.
You can add the question of your analysis in the introduction text as part of your motivation.
Add a flow chart of how the scripts are executed in the README.md
This was derived from the JOSE review checklist and the ROpenSci review checklist.
(Work in progress)
Data analysis review checklist
Reviewer: @stevenleung2018
Conflict of interest
- [x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.
Code of Conduct
- [x] I confirm that I read and will adhere to the MDS code of conduct.
General checks
- [x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
- [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Documentation
- [1/2] Installation instructions: Is there a clearly stated list of dependencies?
- [x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
- [x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
- [1/3] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
Code quality
- [x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
- [x] Style guidelides: Does the code adhere to well known language style guides?
- [x] Modularity: Is the code suitably abstracted into scripts and functions?
- [ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
Reproducibility
- [x] Data: Is the raw data archived somewhere? Is it accessible?
- [x] Computational methods: Is all the source code required for the data analysis available?
- [1/2] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
- [1/2] Automation: Can someone other than the authors easily reproduce the entire data analysis?
Analysis report
- [x] Authors: Does the report include a list of authors with their affiliations?
- [x] What is the question: Do the authors clearly state the research question being asked?
- [x] Importance: Do the authors clearly state the importance for this research question?
- [x] Background: Do the authors provide sufficient background information so that readers can understand the report?
- [x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
- [x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
- [x] Conclusions: Are the conclusions presented by the authors correct?
- [x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
- [x] Writing quality: Is the writing of good quality, concise, engaging?
Estimated hours spent reviewing:
About 1.5 hours.
Review Comments:
- I like the fact that you have tried multiple models and chose the best model in the report, together with a well-thought rationale of the choice.
- The whole data pipeline is present and the code written and organized in a readable fashion for me to follow the whole analysis from the beginning to the end.
There are, however, a few issues I have spotted: 3. Installation instructions: a. The list of dependencies is not available; b. Installation of environment fails on my computer. Here is the error command and error message:
(base) stevenprivate@StevenMac ~/mds/522/Giant_Pumpkins_Weight_Prediction (main) $ conda env create -f environment.yaml Collecting package metadata (repodata.json): done Solving environment: failed ResolvePackageNotFound: - pyqt5-sip==4.19.18=py39h415ef7b_8 - graphite2==1.3.13=1000 - openjpeg==2.4.0=hb211442_1 - libffi==3.4.2=h8ffe710_5 - selenium==3.141.0=py39hb82d6ee_1003 - intel-openmp==2021.4.0=h57928b3_3556 - pandas==1.3.4=py39h2e25243_1 - win_inet_pton==1.1.0=py39hcbf5309_3 - ucrt==10.0.20348.0=h57928b3_0 - libxgboost==1.3.0=h0e60522_3 - xorg-libx11==1.7.2=hcd874cb_0 - sqlite==3.36.0=h8ffe710_2 - setuptools==59.2.0=py39hcbf5309_0 - zeromq==4.3.4=h0e60522_1 - nodejs==14.17.4=h57928b3_0 - preshed==3.0.6=py39h415ef7b_1 - libsodium==1.0.18=h8d14728_1 - xorg-libice==1.0.10=hcd874cb_0 - murmurhash==1.0.6=py39h415ef7b_2 - cairo==1.16.0=hb19e0ff_1008 - cffi==1.15.0=py39h0878f49_0 - jpeg==9d=h8ffe710_0 - xorg-libxau==1.0.9=hcd874cb_0 - libpng==1.6.37=h1d00b33_2 - libwebp==1.2.1=h57928b3_0 - pyqt==5.12.3=py39hcbf5309_8 - statsmodels==0.13.1=py39h5d4886f_0 - scikit-learn==1.0.1=py39he931e04_2 - zlib==1.2.11=h8ffe710_1013 - gettext==0.19.8.1=ha2e2712_1008 - py-xgboost==1.3.0=py39hcbf5309_3 - pyqtwebengine==5.12.1=py39h415ef7b_8 - tornado==6.1=py39hb82d6ee_2 - psutil==5.8.0=py39hb82d6ee_2 - libclang==11.1.0=default_h5c34c98_1 - libbrotlicommon==1.0.9=h8ffe710_6 - xorg-libxt==1.2.1=hcd874cb_2 - certifi==2021.10.8=py39hcbf5309_1 - m2w64-gcc-libs-core==5.3.0=7 - graphviz==2.49.3=hefbd956_0 - libbrotlienc==1.0.9=h8ffe710_6 - catalogue==2.0.6=py39hcbf5309_0 - libxcb==1.13=hcd874cb_1004 - harfbuzz==3.1.1=hc601d6f_0 - pyzmq==22.3.0=py39he46f08e_1 - zstd==1.5.0=h6255e5f_0 - regex==2021.11.10=py39hb82d6ee_0 - xorg-libxpm==3.5.13=hcd874cb_0 - vega-cli==5.17.0=h0e60522_4 - srsly==2.4.2=py39h415ef7b_0 - pcre==8.45=h0e60522_0 - pyrsistent==0.18.0=py39hb82d6ee_0 - scipy==1.7.2=py39hc0c34ad_0 - ipykernel==6.5.0=py39h832f523_1 - xorg-libsm==1.2.3=hcd874cb_1000 - cython-blis==0.7.5=py39h5d4886f_1 - pthread-stubs==0.4=hcd874cb_1001 - jbig==2.1=h8d14728_2003 - lcms2==2.12=h2a16943_0 - vc==14.2=hb210afc_5 - matplotlib==3.5.0=py39hcbf5309_0 - libwebp-base==1.2.1=h8ffe710_0 - fonttools==4.28.1=py39hb82d6ee_0 - debugpy==1.5.1=py39h415ef7b_0 - importlib-metadata==4.8.2=py39hcbf5309_0 - libbrotlidec==1.0.9=h8ffe710_6 - chardet==4.0.0=py39hcbf5309_2 - python==3.9.7=h7840368_3_cpython - libxml2==2.9.12=hf5bbc77_1 - liblapack==3.9.0=12_win64_mkl - libblas==3.9.0=12_win64_mkl - llvmlite==0.36.0=py39ha0cd8c8_0 - numpy==1.21.4=py39h6635163_0 - libzlib==1.2.11=h8ffe710_1013 - m2w64-gmp==6.1.0=2 - pysocks==1.7.1=py39hcbf5309_4 - tk==8.6.11=h8ffe710_1 - libglib==2.70.1=h3be07f2_0 - lerc==3.0=h0e60522_0 - brotli==1.0.9=h8ffe710_6 - msys2-conda-epoch==20160418=1 - cryptography==36.0.0=py39h7bc7c5c_0 - pywin32==302=py39hb82d6ee_2 - libdeflate==1.8=h8ffe710_0 - numba==0.53.0=py39h69f9ab1_0 - vega-lite-cli==4.17.0=h57928b3_2 - fribidi==1.0.10=h8d14728_0 - catboost==1.0.3=py39hcbf5309_1 - click==8.0.3=py39hcbf5309_1 - m2w64-gcc-libs==5.3.0=7 - xz==5.2.5=h62dcd97_1 - jupyter_core==4.9.1=py39hcbf5309_1 - matplotlib-base==3.5.0=py39h581301d_0 - gts==0.7.6=h7c369d9_2 - jedi==0.18.1=py39hcbf5309_0 - ca-certificates==2021.10.8=h5b45459_0 - brotli-bin==1.0.9=h8ffe710_6 - libgd==2.3.3=h8bb91b0_0 - markupsafe==2.0.1=py39hb82d6ee_1 - lightgbm==3.3.1=py39h415ef7b_1 - libtiff==4.3.0=hd413186_2 - xorg-libxdmcp==1.1.3=hcd874cb_0 - kiwisolver==1.3.2=py39h2e07f2f_1 - m2w64-gcc-libgfortran==5.3.0=6 - qt==5.12.9=h5909a2a_4 - m2w64-libwinpthread-git==5.0.0.4634.697f757=2 - xorg-xextproto==7.3.0=hcd874cb_1002 - pyqt-impl==5.12.3=py39h415ef7b_8 - fontconfig==2.13.1=h1989441_1005 - pango==1.48.10=h33e4779_2 - vs2015_runtime==14.29.30037=h902a5da_5 - spacy==3.2.0=py39hefe7e4c_0 - freetype==2.10.4=h546665d_1 - lz4-c==1.9.3=h8ffe710_1 - xorg-kbproto==1.0.7=hcd874cb_1002 - mkl==2021.4.0=h0e2418a_729 - brotlipy==0.7.0=py39hb82d6ee_1003 - xorg-xproto==7.0.31=hcd874cb_1007 - thinc==8.0.12=py39hefe7e4c_0 - tbb==2021.4.0=h2d74725_1 - pyqtchart==5.12=py39h415ef7b_8 - cymem==2.0.6=py39h415ef7b_2 - libiconv==1.16=he774522_0 - pandoc==2.16.2=h8ffe710_0 - icu==68.2=h0e60522_0 - ipython==7.29.0=py39h832f523_2 - expat==2.4.1=h39d44d4_0 - pixman==0.40.0=h8ffe710_0 - libcblas==3.9.0=12_win64_mkl - pydantic==1.8.2=py39hb82d6ee_2 - pillow==8.4.0=py39h916092e_0 - shap==0.40.0=py39h2e25243_0 - xorg-libxext==1.3.4=hcd874cb_1 - getopt-win32==0.1=h8ffe710_0
Thus the subsequent
conda activate pumpkin
also fails.The
bash run_all.sh
runs fine after Iconda install scikit-learn
manually. I have 2 suggestions for you to improve on this: i. You should include a list like the example given by Tiffany (https://github.com/ttimbers/breast_cancer_predictor#dependencies); and ii. The list should come from which libraries you actually need to import in your Python scripts and your.Rmd
files. By looking at yourenvironmental.yaml
file, I think it is too long. Please note that you only need to include the packages you actually explicitly load in your code by running the commandconda env export -f environment.yaml --from-history
(where the flag--from-history
is the key), andconda
will check the dependencies of those packages and upgrade them as necessary. After that, you should change the name of the environment to something meaningful, and also remove the lastprefix
line.
- Community Guidelines: The following are missing: "2) Report issues or problems with the software 3) Seek support".
- Final report: There is a block quotation under "Data" section about Great Pumpkin Commonweath (GPC) is a bit out of place and redundant since it was already mentioned in the "Introduction" section immediately before it.
Attribution
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thanks Steven for the detailed review and feedbacks. We are planning on implementing your feedbacks in a phased manner, and here is a tentative list of some of the immediate changes we are working on:
- Documentation: On dependencies, we really appreciate your suggestions to improve. We have improved our environment file and finalising a clear and shorter list of dependencies. For example, we have made changes in the environment file to make it platform independent and solve some of the issues which came up in the first release. Here is a link to the specific PR. We will also be appending a list of dependencies in the readme as per the environment file. We are also trying to improve Community guidelines by amending our contribution file.
- Code Quality: We are planning to work on building and improving the tests which should be ready by next week.
- Reproducibility: We understand your inputs regarding the environment issue and working on it as stated above.
1.7 hours.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Data analysis review checklist
Reviewer: @RamiroMejia
Conflict of interest
- [x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.
Code of Conduct
- [x] I confirm that I read and will adhere to the MDS code of conduct.
General checks
- [x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
- [x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Documentation
- [x] Installation instructions: Is there a clearly stated list of dependencies?
- [x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
- [x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
- [x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
Code quality
- [x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
- [x] Style guidelides: Does the code adhere to well known language style guides?
- [x] Modularity: Is the code suitably abstracted into scripts and functions?
- [ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
Reproducibility
- [x] Data: Is the raw data archived somewhere? Is it accessible?
- [x] Computational methods: Is all the source code required for the data analysis available?
- [ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
- [ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?
Analysis report
- [ ] Authors: Does the report include a list of authors with their affiliations?
- [x] What is the question: Do the authors clearly state the research question being asked?
- [x] Importance: Do the authors clearly state the importance for this research question?
- [x] Background: Do the authors provide sufficient background information so that readers can understand the report?
- [x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
- [ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
- [x] Conclusions: Are the conclusions presented by the authors correct?
- [x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
- [x] Writing quality: Is the writing of good quality, concise, engaging?
Estimated hours spent reviewing: 1.5
Review Comments:
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
- It seems like you have problems with missing values in some of the features. It would be interesting to know how many NAs you have in the data and include this in the EDA.
- You mentioned :
For the numeric features, we used a simple imputer to insert the ‘median’ value for any missing or Null values as well as a standard scaler via a pipeline. For categorical features, we similarly used a simple imputer but instead of filling in values with the mean, we filled them in with the value ‘missing’. We then used one hot encoding to encode the categorical features.
- Why did you decide to leave the NAs with 'missing' you did not use another strategy for the imputation like "most_frequent"
Seems like your Random Forest model takes a lot of time to train. Did you try
DecisionTreeRegressor
?
- Use RandomizedSearchCV to find the best parameters
- Also, it would be interesting if you can include a table of results of the models with the training and the test scores.
- How much your base model improve after the hyperparameter tuning?
- It would be great to know what are the most important features in the model you can add a table with the most important coefficients affecting your target.
- You can add the question of your analysis in the introduction text as part of your motivation.
- Add a flow chart of how the scripts are executed in the README.md
Attribution
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thanks Rameiro for the inputs, truly appreciate your detailed observations. Following are our immediate plans to implement your feedbacks and suggestions:
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
In general, I think the structure of the repo is well done! I would recommend having an .md file for the final report already in the repo for visualization purposes only. Also, I would probably change the name of the final report, just so it is clearer that it is in fact the final results that you got.
I think it would be interesting to dig in deeper into the conclusions of the results of the model. It is shown that the model performs well and the score that it got, but we do not get to hear any conclusions about the features and if your assumptions held on to be true.
I feel like the list of dependencies that we need to install is too long, and it should be limited to the packages that are actually needed for running the analysis.
I guess this will be added later, but it would be a good idea to have tests in your code as well, just to make sure that any manual parts work properly.
I liked the introduction of the report. It was fun to read, and it gives a very good idea on what the purpose of the analysis is.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thanks everyone for your valuable feedback so that we can improve our project. We have made the following changes in regarding to your comment:
Regarding comment 3 in this issue, we have
Regarding comment 5 in this issue, we have updated the EDA document to add figure caption and reposition the plots below the reasoning https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/01f0083e34ff1af0313482a38af613458d1fbf0c
Regarding the comments in this issue about the final report, we have
Regarding comment 5 in this issue, we have
Regarding comment 2 in this issue, we have added a "Critique, Limitations and Future Improvements" section in the final report https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/dfafe4f9c1cddc308d2153c9880acc27d04352cd
Regarding the comments of adding process flow in this issue as well as the other issue, we have added a makefile dependency diagram in the README file https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/commit/66509148343ec1800c0d7d70c7b0e958cb76ba00
Submitting authors: @mahsasarafrazi, @shivajena, @Rowansiv, @imtvwy
Repository: https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction Report link: https://github.com/UBC-MDS/Giant_Pumpkins_Weight_Prediction/blob/main/doc/pumpkin.html
Abstract/executive summary: This project is an attempt to build a prediction model using regression based machine learning models to estimate the weight of giant pumpkins based on their features such as year of cultivation, place, and over the top(ott) size in order to predict the next year’s winner of the GP competition. Different regression based prediction models such as Linear, Ridge and Random Forest were used for training and cross-validation on the training data. For the Ridge model, the hyperparameter (α) was optimised to return the best cross validation score. This model performed fairly well in predicting on the test data which led us to finalise the use of the model for prediction. The best score on cross validation sets is 0.6666134 and the mean test score is 0.6619808. The Random Forest model had similar cross-validation and test scores, but due to its high fit times, it was not chosen for this report. Therefore, for the purpose of reproducibility, we have decided to utilise the Ridge model as our prediction model. For better performance and precision, other models may also be tried on the data.
The data used for this project comes from BigPumpkins.com. The dataset is a public domain resource which pertains to the attributes of giant pumpkins grown in around 20 countries across the world in different regions. The raw data which was used in this project for the analysis can be found here : https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-19/pumpkins.csv
Editor: @mahsasarafrazi, @shivajena, @Rowansiv, @imtvwy Reviewer: @RamiroMejia, @riddhisansare, @stevenleung2018, @ruben1dlg