SimonDedman / gbm.auto

Machine-learning Boosted Regression Tree software suite for species distribution modelling in R
https://doi.org/10.1371/journal.pone.0188955
Other
18 stars 6 forks source link

Gbm.Report #57

Open SimonDedman opened 4 years ago

SimonDedman commented 4 years ago

Gbm.report function using markdown to create a word document with the full results section and model interpretation, data units n sources (get via gbm.call, have user fill em in, generate template csv file to look for), bfcheck & hyperparameter choice n options (gbm.call), loop

Compile all papers' reviewers' questions and criticisms into a doc and address them for FI paper but also here as structured questions / headings. Comparative analysis to GLM GAMs (& PCA?)

SimonDedman commented 4 years ago

varint, correlations between variables. Pre & post modelling section.

SimonDedman commented 3 years ago

Output as a full pdf (& word doc?) for SM which contains everything (inc link to this methodology as a DOI?), also pdf/word doc for the MS methods subsection with only the required info.

SimonDedman commented 3 years ago

Speak to Hugo Flavio whose done a lovely job on this with actel, e.g. https://cran.r-project.org/web/packages/actel/vignettes/a-0_workspace_requirements.html

Edit: now at https://hugomflavio.github.io/actel-website/index.html

SimonDedman commented 3 years ago

see https://yihui.org/knitr/ and https://kbroman.org/knitr_knutshell/ and https://github.com/hugomflavio/actel/blob/c899647d61183467d807c6277fa6e61de6066663/R/explore.R#L652

SD: very briefly about the html report formatting: looks like it's just sink to html then cat(paste0( then the various chunks you want, via knitr or r directly?

Hugo Flávio: So, for some background, the very first actel reports (back when it wasn't even called actel) were pdf files, compiled from LaTeX. So what I did there was sink the LaTeX code into a file, and then use an R command to compile said file to build a pdf when I decided to move to rmarkdown, I kept the mechanics. So, I open the rmd file with line 639, cat the whole content into that file and finally exit it in line 749. This is all happening inside function printExploreRmd(), which is called by explore() in line 519 of the same code file.

SD: cheers. reading more about knitr, I wonder if spin and knit2html would be the preferred route if starting fresh (obviously you already had the working code)

HF: then rmarkdnown::render() comes in in line 531 to compile the html file, from the rmd we created before. ah, one last note: CRAN is very picky about files being written into the user's directory, so all the writing/compiling is happening in tempdir(), and then only the final html file is brought in, in line 535. with a final touch of automatically opening the report file with browseURL(). I haven't tried any other route, so I don't know, perhaps there is a better way. I find that having an rmd makes it easy and straight forward to edit things. Just go to the right place in the cat() call and edit as desired. oh, also, you can check lines 751 to 837, which contain the code for that index on the left side of the report.

SimonDedman commented 3 years ago

See "gbm.auto team assemble!" email from me, 2021-10-07, and /home/simon/Dropbox/Galway/Analysis/R/gbm.auto/Gbm.auto_extras/Report_Statistics_Explainer_Improvements.ods

Chuck:

I've actually found the eval metrics as they exist right now to be really helpful (especially now that I have to communicate with people who came up on the industry side - they seem much more comforted by the true/false positive/negative metrics than, say, CV and AUC). I do like the idea of including training vs. CV AUC in there, since it's becoming a more regularly reported thing and I don't know that I've ever had a reviewer not ask me about overfitting in a BRT paper. I think seeing the CV statistics is really interesting and could be helpful in explaining how the model "picked" the best performing set of parameters.

I wonder if the idea of compiling all the reported eval metrics is worth putting a paper together? I'm sure none of us have a ton of time to devote to such a thing right now, but it would definitely be handy to have a paper on the best practices in evaluating BRT performance available to cite.

Bonnie:

The other thing I guess I am fuzzy on is making it clear that you want models with the lowest deviance however this isn't the same as deviance explained relative to null. So in report.csv I see that the CV mean deviance is lowest in the best model (yay!) but in the MLEvalMetrics at the bottom there is "dev" and I don't know if I need to do anything with that number.

might be good to note if certain metrics aren't available for an abundance only model (issue I had before I still need to run the analysis on). I couldn't find anywhere that explicitly said you wouldn't get an AUC/ROC for abundance model but have the email thread where we agreed we didn't think you could get it if you didn't run the binary model.

That spreadsheet now here: https://docs.google.com/spreadsheets/d/1I43q-PAEGY97Ho_xZpZ5gbBjGgJodLzZGcaZuwaNUuQ/edit?usp=sharing

SimonDedman commented 3 years ago

see /home/simon/Dropbox/PostDoc Work/Machine Learning & Evaluation/ML&E Notes.docx

this file now in gbm.auto extras subfolder in this github.

SimonDedman commented 3 years ago

see also https://github.com/adamlilith/enmSdm#model-evaluation & more sections

SimonDedman commented 3 years ago

and https://rspatial.org/raster/sdm/5_sdm_models.html#model-evaluation

ChuckBangley commented 3 years ago

RE: training vs. CV AUC. Is there a hard and fast rule as to how much of a difference between the two constitutes overfitting? Or is it something we can tease out of the literature? If so it may be worth establishing some sort of threshold to make it obvious when overfitting is likely to be occurring.

BonnieAhr commented 3 years ago

clarification in the documentation that gbm.loop is only used when using the predictive mapping option (i.e. grid = TRUE). (Correct me if I'm wrong - still learning).

SimonDedman commented 3 years ago

@BonnieAhr you don't need to predictive map to run loop, it'll generate the summaries of various report outputs regardless.

BonnieAhr commented 3 years ago

Under Best Binary BRT column in report.csv add # of trees as well. Actually maybe all of the info for the best brt in one column (like deviance and cv correlation) so I don't have to scroll over and find it and check 100 times that I'm looking at the right on. (Potentially even a flag if under 1000 trees). If not in the output maybe could also include a warning that pops up at the end of the run if the best model has under 1000 trees.

SimonDedman commented 3 years ago

<1000 trees warning: might be easier to put that as part of the document thingy, i.e. the notes/context section.

Big/Gaus best extra stats: L934, 942, 959, 974: edit these to feature basically everything (single values only) from Col B of https://docs.google.com/spreadsheets/d/1I43q-PAEGY97Ho_xZpZ5gbBjGgJodLzZGcaZuwaNUuQ/edit#gid=141061300 ?

Edit: Self_CV_Statistics.csv has all CV & selfstats table contents for each model. Report has number of trees. Maybe just add, to bin_best: trees: 2250 CV Mean Deviance: 0.660950842944151 CV Deviance SE: 0.0253367448140445 CV Mean Correlation: 0.671531063164227 CV Correlation SE: 0.0159719292986984 training CV - AUC CV = overfitting

Then explain everything in the separate document.

SimonDedman commented 2 years ago

image Elith et al 2008 working guide SM. All elements now encapsulated in google docs table.

In the gbm.step outputs, they list (presumably based on importance/relevance):

SimonDedman commented 2 years ago

How to understand/read the ML evaluation plots, what to look for. Have all the explanation in a document that lives in gbm.auto somewhere, maybe as a dataframe that people can call with "data"?? Or in the vignettes?

SimonDedman commented 2 years ago
SimonDedman commented 2 years ago

Lit review started here: https://docs.google.com/document/d/1DybTZs6j4rUWaIBlbIcobN8839kK53nP873-mi3wt9k/edit?usp=sharing Please add your stuff. I'll finish adding papers then populate out their sections.

SimonDedman commented 2 years ago

2019.10.25 Lies, Damned Lies, and Accuracy Metrics in Machine Learning.odt 2021-03-02_Ashley_Jester_Stats_Consultation.odt evaluating machine learning models: https://learning.oreilly.com/library/view/evaluating-machine-learning/9781492048756/

SimonDedman commented 2 years ago

Conventional machine vs deep learning: classifying goliath grouper inertial measurement unit data into behaviours. [Matthew’s Correlation Coefficient, Cohen’s Kappa]. 25 https://bls.econference.io/public//main/sessions/3679 Lauran Brewster.

Ask Lauran about this.

SimonDedman commented 2 years ago

image my image from 2021-03-02_Ashley_Jester_Stats_Consultation.odt above

SimonDedman commented 2 years ago

https://www.statology.org/balanced-accuracy/

SimonDedman commented 2 years ago

The correlation between predicted habitat values (averaged to 5° x 5° spatial resolution) and CPUE was calculated using a modified t-test from the R package ‘SpatialPack’ (Osorio et al., 2020) which assumes correlations to be inflated by spatial autocorrelation.

Jon Dale paper

SimonDedman commented 2 years ago

https://mlr.mlr-org.com/articles/tutorial/measures.html

SimonDedman commented 2 years ago

https://scikit-learn.org/stable/modules/model_evaluation.html see also https://scikit-learn.org/stable/model_selection.html#model-selection & generally https://scikit-learn.org/stable/index.html

SimonDedman commented 2 years ago

Elith 2008: “Predictive performance should not be estimated on training data, but results are provided in Table 3 to show that BRT overfits the data, regardless of careful model development. [...] Statistics on predictive performance can be estimated from the subsets of data excluded from model fitting”, so: CV Mean Correlation | cv.statistics$correlation.mean = cv.cor ? Even then: how does this compare to e.g. GAMs?

Abeare thesis: “For each of the fitted models, the pseudo-R2, or D2, was calculated for comparison, where: D2 = 1 – (residual deviance/total deviance)”

"Lastly, the predictive performance of the four statistical modeling techniques was evaluated. Root mean square errors (RMSE) were calculated for predictions of log(CPUEyft) made by the fitted models, using the predictor values of the test dataset. Paired t-tests were then used to test the actual versus predicted log(CPUEyft), whereas F-tests were used to test for differences in the variance of predictions between the modeling techniques." pdf p57 barplots of RMSE & D2 for GAM GLM BRT, but p90 p91 no code for these plots.

RMSE: https://www.statology.org/how-to-calculate-rmse-in-r/ Calculated manually, needs a prediction vs observation i.e. manual test/train. Could be calculated internally by gbm/step, per k-fold CV?

SimonDedman commented 2 years ago

Hastie, T., R. Tibshirani, and J.H. Friedman, 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York

gbm.step metrics: self.statistics$correlation = self.cor : runs predict.gbm using gbm object created from all user values inc all data, and predicts it to all data. cv.statistics$correlation.mean = cv.cor. mean of correlation (cor()) for each of k folds of CV. Should defo use cv.cor NOT self.cor.

cv.statistics$deviance.mean = cv.dev = dismo::calc.deviance() for each of k folds "Function to calculate deviance given two vectors of observed and predicted values. Requires a family argument which is set to binomial by default" Not very helpful. Number can be >1

Added rmse to gbm.step.sd; could also add loads of other Metrics package metrics.

SimonDedman commented 2 years ago

Abeare: D2 = 1 – (residual deviance/total deviance) cv.statistics$deviance.mean = cv.dev self.statistics$null = total.deviance is now: cv.stats$d.squared in gbm.step.sd

SimonDedman commented 2 years ago

Need to work out how to do these for GLM & GAM still. And re-run the models, e.g. for FI paper.

SimonDedman commented 2 years ago

Already added deviance explained relative to null % in report. gbm.auto v1.5.7

SimonDedman commented 1 year ago

Report to add:

Add package version used somewhere. Not report, separate file?

packageVersion("gbm.auto") [1] ‘2023.3.12’ packageVersion("dismo") [1] ‘1.3.10’

Similarly sample sizes for bin & gaus?

Could put both in MLEvalMetricsBin.csv but relies on running Bin AND MLEvaluate==TRUE. Just add another file? Best place is in Report. Just before you save report (L1184), add a final column with gbm.auto version dismo version sample size fam1 {sample size fam2}

could add ZI% under ZI T/F?

DONE

SimonDedman commented 1 year ago

If doing predictions, once we have the predictions, also predict to the input data then run ML eval metrics on those results: RMSE & %Dev explained (possible?) for continuous predictions, AUC & TSS for binary.

mhpob commented 11 months ago

Just chiming in here with an implementation suggestion: I've found great utility in putting a template rmd/qmd file in inst and providing all of the info via quarto::quarto_render(..., execute_params = ...) or rmarkdown::render(..., params = ...). @hugomflavio did an excellent job putting the raw RMarkdown into a function, it just wound up being tough for me to follow/debug in my own code!

Example boilerplate: https://github.com/mhpob/otndo/tree/main/inst/qmd_template Example rendering function/call: https://github.com/mhpob/otndo/blob/main/R/make_receiver_push_summary.R#L163-L189

SimonDedman commented 11 months ago

https://jmlondon.github.io/pathroutr/articles/akharborseal_demo.html another great josh london example