Gbm.Report - Githubissues

SimonDedman commented 4 years ago

Gbm.report function using markdown to create a word document with the full results section and model interpretation, data units n sources (get via gbm.call, have user fill em in, generate template csv file to look for), bfcheck & hyperparameter choice n options (gbm.call), loop

Compile all papers' reviewers' questions and criticisms into a doc and address them for FI paper but also here as structured questions / headings. Comparative analysis to GLM GAMs (& PCA?)

SimonDedman commented 4 years ago

varint, correlations between variables. Pre & post modelling section.

SimonDedman commented 3 years ago

Output as a full pdf (& word doc?) for SM which contains everything (inc link to this methodology as a DOI?), also pdf/word doc for the MS methods subsection with only the required info.

SimonDedman commented 3 years ago

Speak to Hugo Flavio whose done a lovely job on this with actel, e.g. https://cran.r-project.org/web/packages/actel/vignettes/a-0_workspace_requirements.html

Edit: now at https://hugomflavio.github.io/actel-website/index.html

SimonDedman commented 3 years ago

see https://yihui.org/knitr/ and https://kbroman.org/knitr_knutshell/ and https://github.com/hugomflavio/actel/blob/c899647d61183467d807c6277fa6e61de6066663/R/explore.R#L652

SD: very briefly about the html report formatting: looks like it's just sink to html then cat(paste0( then the various chunks you want, via knitr or r directly?

Hugo Flávio: So, for some background, the very first actel reports (back when it wasn't even called actel) were pdf files, compiled from LaTeX. So what I did there was sink the LaTeX code into a file, and then use an R command to compile said file to build a pdf when I decided to move to rmarkdown, I kept the mechanics. So, I open the rmd file with line 639, cat the whole content into that file and finally exit it in line 749. This is all happening inside function printExploreRmd(), which is called by explore() in line 519 of the same code file.

SD: cheers. reading more about knitr, I wonder if spin and knit2html would be the preferred route if starting fresh (obviously you already had the working code)

HF: then rmarkdnown::render() comes in in line 531 to compile the html file, from the rmd we created before. ah, one last note: CRAN is very picky about files being written into the user's directory, so all the writing/compiling is happening in tempdir(), and then only the final html file is brought in, in line 535. with a final touch of automatically opening the report file with browseURL(). I haven't tried any other route, so I don't know, perhaps there is a better way. I find that having an rmd makes it easy and straight forward to edit things. Just go to the right place in the cat() call and edit as desired. oh, also, you can check lines 751 to 837, which contain the code for that index on the left side of the report.

SimonDedman commented 3 years ago

See "gbm.auto team assemble!" email from me, 2021-10-07, and /home/simon/Dropbox/Galway/Analysis/R/gbm.auto/Gbm.auto_extras/Report_Statistics_Explainer_Improvements.ods

Chuck:

I've actually found the eval metrics as they exist right now to be really helpful (especially now that I have to communicate with people who came up on the industry side - they seem much more comforted by the true/false positive/negative metrics than, say, CV and AUC). I do like the idea of including training vs. CV AUC in there, since it's becoming a more regularly reported thing and I don't know that I've ever had a reviewer not ask me about overfitting in a BRT paper. I think seeing the CV statistics is really interesting and could be helpful in explaining how the model "picked" the best performing set of parameters.

I wonder if the idea of compiling all the reported eval metrics is worth putting a paper together? I'm sure none of us have a ton of time to devote to such a thing right now, but it would definitely be handy to have a paper on the best practices in evaluating BRT performance available to cite.

Bonnie:

The other thing I guess I am fuzzy on is making it clear that you want models with the lowest deviance however this isn't the same as deviance explained relative to null. So in report.csv I see that the CV mean deviance is lowest in the best model (yay!) but in the MLEvalMetrics at the bottom there is "dev" and I don't know if I need to do anything with that number.

might be good to note if certain metrics aren't available for an abundance only model (issue I had before I still need to run the analysis on). I couldn't find anywhere that explicitly said you wouldn't get an AUC/ROC for abundance model but have the email thread where we agreed we didn't think you could get it if you didn't run the binary model.

That spreadsheet now here: https://docs.google.com/spreadsheets/d/1I43q-PAEGY97Ho_xZpZ5gbBjGgJodLzZGcaZuwaNUuQ/edit?usp=sharing

SimonDedman commented 3 years ago

see /home/simon/Dropbox/PostDoc Work/Machine Learning & Evaluation/ML&E Notes.docx

this file now in gbm.auto extras subfolder in this github.

SimonDedman commented 3 years ago

see also https://github.com/adamlilith/enmSdm#model-evaluation & more sections

SimonDedman commented 3 years ago

and https://rspatial.org/raster/sdm/5_sdm_models.html#model-evaluation

ChuckBangley commented 3 years ago

RE: training vs. CV AUC. Is there a hard and fast rule as to how much of a difference between the two constitutes overfitting? Or is it something we can tease out of the literature? If so it may be worth establishing some sort of threshold to make it obvious when overfitting is likely to be occurring.

BonnieAhr commented 3 years ago

clarification in the documentation that gbm.loop is only used when using the predictive mapping option (i.e. grid = TRUE). (Correct me if I'm wrong - still learning).

SimonDedman commented 3 years ago

@BonnieAhr you don't need to predictive map to run loop, it'll generate the summaries of various report outputs regardless.

BonnieAhr commented 3 years ago

Under Best Binary BRT column in report.csv add # of trees as well. Actually maybe all of the info for the best brt in one column (like deviance and cv correlation) so I don't have to scroll over and find it and check 100 times that I'm looking at the right on. (Potentially even a flag if under 1000 trees). If not in the output maybe could also include a warning that pops up at the end of the run if the best model has under 1000 trees.

SimonDedman commented 3 years ago

<1000 trees warning: might be easier to put that as part of the document thingy, i.e. the notes/context section.

Big/Gaus best extra stats: L934, 942, 959, 974: edit these to feature basically everything (single values only) from Col B of https://docs.google.com/spreadsheets/d/1I43q-PAEGY97Ho_xZpZ5gbBjGgJodLzZGcaZuwaNUuQ/edit#gid=141061300 ?

Edit: Self_CV_Statistics.csv has all CV & selfstats table contents for each model. Report has number of trees. Maybe just add, to bin_best: trees: 2250 CV Mean Deviance: 0.660950842944151 CV Deviance SE: 0.0253367448140445 CV Mean Correlation: 0.671531063164227 CV Correlation SE: 0.0159719292986984 training CV - AUC CV = overfitting

Then explain everything in the separate document.