UBC-MDS / DSCI522_group17

Group work for DSCI522 Our group number is 17
MIT License
1 stars 5 forks source link

Peer Feedback from Rafael #23

Closed rtaph closed 3 years ago

rtaph commented 3 years ago

Hello Pan, Chun, and Sakshi!

Here is some peer feedback on 3f0c4a9

On the whole, the project looks cohesive and professional. Nicely done! I did not run into any technical issues with your project using the a minimal environment with the listed dependencies.

Here are some thoughts:

1. Reproducibility

Your Make code runs on my computer from start to finish (hooray!), but your results are not 100% reproducible because you do not use random_state = 123 or np.random.seed(123). I noticed this impacts your accuracy scores for your RandomForest() and DummyClassifier(). I would parametrize the docopt script to take a default seed that is then set before the cross-validation begins. I would also suggest removing the score time and fit time from your report, because I don’t think that it is that relevant, and they will always be different from one run to another (and therefore, will register as a git diff). If you do keep them in your report, maybe I would suggest reporting the fit/score times in a separate table, so it is easier to compare (with git) whether you get the same CV scores from one call of make all to another.

2. Dry-out your Makefile

Instead of writing out almost the same code and dependencies for each report (md/HTML), you could merge them into one and use the argument output_format = 'all' in `render().

# render the report
docs/report.md : docs/report.Rmd results/boxplot.png results/densityplot.png results/cv_results.csv results/Bestmodel.csv 
    Rscript -e "rmarkdown::render('docs/report.Rmd')"
docs/report.html : docs/report.Rmd results/boxplot.png results/densityplot.png results/cv_results.csv results/Bestmodel.csv
    Rscript -e "rmarkdown::render('docs/report.Rmd', output_format = 'github_document')"

This could be collapsed down to:

# render the reports
docs/report.md docs/report.html : docs/report.Rmd results/boxplot.png results/densityplot.png results/cv_results.csv results/Bestmodel.csv 
    Rscript -e "rmarkdown::render('docs/report.Rmd', output_format = 'all')"

This would require you to also add github_document to your report YAML:

output: 
  github_document:
    toc: true
  html_document:
    toc: true

3. Overfitting or not overfitting?

In your report you say "From our cross validation results, we can see that we are overfitting with Random_Forest". But your test-set score is in line with your CV score. I would not say that this is overfitting. Of course, your train score is memorizing the data, but this seems trivial since no one in their right mind would use training performance as an estimate of generalization error for complex models.

This is also mentioned in your discussion section.

4. Minor issues and suggestions


Great work! Keep it up!

jachang0628 commented 3 years ago

Thank you Rafael for your feedbacks!