Peer review (Kevin) - Githubissues

Hi guys.

Great work on the project. I found it really interesting to read, and the project looked great. I will share a few minor suggestions that might be helpful to you.

Documentation:

I think having the author of each script on top and the date the script was created might be helpful if in the future there is a need to change the code or there was a question about one part of the code, the person responsible for that code could easily be found.

Communication:

Although it is possible to use the docker file to remake the repository having all the files in the repository present might also be a good idea for someone who wants to see the results. While I was navigating the last version of the repository and the milestone, I failed to see the raw data and processed data in a data folder.
After making the files, I saw the creation of the files used in the final report. Still, just by looking in the repository, I couldn't find the feather files used in the final report, which might be useful to have in the repository so that for the reader, it becomes apparent that some files are created so that they could be used in the final report.

I tried to understand where the misc/process files were used in the pipeline as I didn't see them being used in the makefile. If used, consider removing them or giving them a better name because names like part1_code and part2_code don't help the reader understand what those files are doing.

Code:

in the src/preprocess.py script, cross_val_score and cross_validate were imported but not used. Removing unused libraries from the scripts helps with readability. When I reviewed the script and saw cross_validate imported, I thought cross-validation is being used in that script, but that was not the case.

Analysis and reasoning:

I think the two below things that I mention are that I think are the ones that are important to be addressed in the model_fitting.py file cv=3 was used. Usually, the standard is cv=5 or more because using a higher cv usually means that the final reported error variation is reduced. Many times cv=10 or even more is chosen. I think cv=3 is a bit low and might not return the best results. It might be a good idea to either increase the result or to mention in the final report why cv=3 was chosen.
The hyperparameter optimization used was random search cv. For example, something that I think should be addressed is why the hyperparameters weren't specified as a range. for example "randomforestregressor__n_estimators": [300, 600, 900] --> "randomforestregressor__n_estimators": [300:900]. Not defining a range and just a few values, the hyperparameter could get defeats using random search cv because the search limit is heavily being constrained. Maybe try to change the hyperparameters to ranges and see if a better result could be found.

Altogether , I think the project was greatly done, and the things I mentioned were minor. Great job

UBC-MDS / DSCI_522_group27

Peer review (Kevin) #93