NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
4 stars 0 forks source link

Seed number in base learner fitting #244

Closed sigmafelix closed 8 months ago

sigmafelix commented 8 months ago

I am working on base learners and fitting random forests (with ranger) and XGBoost (with xgboost) is about to be completed. Since these algorithms are based on randomization and we will reuse a fitted model object for predicting 8+M points, I think we need to set a seed number in a config file or a pipeline setting.

kyle-messier commented 8 months ago

@sigmafelix Coincidentally I just came across this LinkedIn post yesterday about tuning XGBoost models.

That's great you are starting some base learner models. However, beginning next week, I think we should prioritize (1) Renaming/Refactoring function names like we have done for the download.R functions, and (2) setting up the target-package pipeline. Target should handle your seed number and config file concern.

sigmafelix commented 8 months ago

@Spatiotemporal-Exposures-and-Toxicology The post would be very helpful to streamline the base learner fitting process. Thank you for sharing the post. For renaming/refactoring, I think we abide by an implicit naming convention where the functions in a R file (i.e., in ./R) has the name starting with the R file name. Perhaps we need to make it explicit to everyone in the next week's meeting.

sigmafelix commented 8 months ago

My comment in #191 includes an example of _targets.R where a seed number is set through the entire pipeline. It should be checked if the seed number setting is applied throughout the multithreaded calculation (i.e., covariate calculation).