Open kyle-messier opened 3 months ago
The current solution and one that I recommend is running the entire pipeline on the GEO cluster. With ~2TB of scatch, we can read in large dataframes and fit the models. While this is not the elegant solution, I think it will get the job done.
GEO does have the native Spark and Hadoop installed. R has the Sparklyr https://spark.posit.co/. This will be more intelligent on read/write in a distributed manner.
This approach will likely take a bit more code and infrastructure development on our part, which is why I suggest we hold off for now. This is a future extension and we can get additional support to help extend it a truly more scalable fashion.
I also use to work with foreach
package to deal with core dispatching. Maybe it's gonna be useful for parallelization on cross validation sets.
@kyle-messier I agree on prioritizing the model and pipeline building now. Another consideration in dealing with large feature data -- When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset.
@eva0marques We are currently relying on future
framework for parallelization since it has a variety of branched-out packages for automatic SLURM job submission. I think using foreach
is a convenient and straightforward way to parallelize heavy workloads. We can discuss more about parallelization backends throughout the pipeline soon.
@sigmafelix When you say "When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset." Do you mean subsampled models or something else? We could also limit ourselves to purely spatial cross-validation as opposed to space-time CV.
@kyle-messier I meant we will use a compact three-variable data.frame
, sf
, or SpatVector
(I believe sf
is what we will use as spatialsample
uses sf
objects) to get an rset
object. I think full 2M*3.8K sf object is too big to handle even in GEO. rset
created by rsample::vfold_cv
is a nested tibble
object, which will end up creating a massive tibble
or sf
with 20M effective rows. That said, we can utilize a reduced sf
objects to create a rset
object containing row indices for each training-test split then read necessary rows out of it for model fitting.
Thanks @sigmafelix. It looks like rset
is basically bootstrapping within the rsamples
package. In the end, if getting the 20M+ rows plus large P into GEO to model is the main computational bottleneck - i.e. RAM - then I'm not opposed to splitting the base-learner-to-meta-learner process into smaller, overlapping regions. The spatialsample
package could be helpful in that respect. Taking the ensemble approach we are adopting even further, we can fit say 10-50 spatial subsamples that are say 50 percent of the spatial domain. Still large models that should be relatively stable, but will hopefully bring us within the RAM capacity of GEO.
@sigmafelix If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin
. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too.
@kyle-messier Is randomness for variations in grid configuration across models or irregular grid generation? The latter is possible with mode="grid_quantile" in the current version. For the former, I think it requires a new function since par_make_grid takes any sf/SpatVector input and uses its extent to generate grids.
From: {SET}group @.> Sent: Saturday, April 6, 2024 22:10 To: NIEHS/beethoven @.> Cc: Insang Song @.>; Mention @.> Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325)
@sigmafelixhttps://github.com/sigmafelix If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too.
— Reply to this email directly, view it on GitHubhttps://github.com/NIEHS/beethoven/issues/325#issuecomment-2041277091, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCFCUVKIVNPK234E6JBGXDY4CTKZAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGI3TOMBZGE. You are receiving this because you were mentioned.Message ID: @.***>
@sigmafelix I was thinking with irregular grid generation - perhaps like kmeans. But now that I think about it more, I think a spatial-block-cv approach would take care of it. For example, if the US domain was partitioned into 10-sections, then each section is fit and predicted on 9 models. This CV is at the highest level and I would not really consider it CV, but more of partitioning to deal with computation. The spatial or temporal CV we develop will be within each of these. Let me know what you think about this approach.
Sorry @sigmafelix - that wouldn't do much for dealing with the RAM computational issues. We'd want the partitioned training sets to be ~50% of the size. The general idea was to ensure that every location still gets multiple partitioned models trained and validated on it. Perhaps it could be done wiwth the regular grid, plus variations of the grid_merge.
@kyle-messier I agree on generating Irregular grids since the site locations are unequally distributed. We could make grids partially overlapping such that many locations will get multiple partitioned models; however, the some will get a single model unless we set a large overlapping distance. If we only consider spatial partitioning, the sample size per grid will be substantially reduced from the fact that one site having 3.8K*1.9K data elements.
@sigmafelix in chopin
, does your par_cut_coords function have some randomness component to it? Looking at the code, it does not appear so, thus running on the same group of coordinates will always produce the same set.
There is a package anticlust
https://cran.r-project.org/web/packages/anticlust/index.html that can create roughly equal size clustering groups. We could potentially calculate say 10 - 50 groups of 2 (i.e. 50% of the data) and pick the top 5 or 10 that ensure all of the AQS points are in say 4/5 or 9/10 of the groups.
In terms of partitioning, do you think cutting our overall spatial sample size in half, will allow us to run the model on GEO relatively easiliy?
@kyle-messier anticlust
seems to operate in conversely from k-means. Do we want to maximize between-group similarity?
For balancing the group sizes, I think we can consider making a "seed" set of clusters from k-means or k-medoids then rotate the membership along the boundary of two adjacent clusters.
My calculation is the full dataset will be ~30GB (if the precision is limited 4 Bytes) to ~60GB (8 Bytes), which is well below GEO's memory limit. Memory consumption of each model will be the key factor of feasibility of model fitting. I will test some with my laptop then get an estimation.
@sigmafelix Ok, that back of the envelope calc is not as big as I thought. Nonetheless, yes my thought was that something like anitclust
would make it easy to create 10 sets of roughly equal partitions of the whole US dataset into 33% or 50% of the total size. That would ensure that every point is still included an equal number of times in the high-level ensemble. Perhaps just doing some high-level kmeans/medoids as a you suggest is easier and good enough.
anticlust::balanced_clustering
creates perfectly even clusters. As you mentioned, it maximizes between cluster similarity. It is also deterministic as it always starts with the center cluster. So an easy way to improve computational efficiency with a divide-and-conquer approach would be to:
So a bit ad hoc, but would add some non-stationarity to the model while also reducing computational burden of RAM (may take longer, but that is okay). We can discuss more when @sigmafelix is back in a couple weeks.
@dzilber Do you know a way we can approximate storage of the model fitting process?
@kyle-messier One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients from the last layer that was updated. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate.
@kyle-messier Got it. We could implement the approach in the pipeline. I think I missed something in 2. Is 5 randomly chosen from 10 or does mean something else? We have three base learners (RF/XGB/ANN; maybe +GP = 4 base learners) and ten equal-size clusters, which made me a little bit confused to interpret the combination step.
From: dzilber @.> Sent: Thursday, April 11, 2024 16:46 To: NIEHS/beethoven @.> Cc: Insang Song @.>; Mention @.> Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325)
@kyle-messierhttps://github.com/kyle-messier One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients for each layer. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate.
— Reply to this email directly, view it on GitHubhttps://github.com/NIEHS/beethoven/issues/325#issuecomment-2050508946, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCFCUUSFPK6ORQ3CCDDV4LY43ZBRAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGUYDQOJUGY. You are receiving this because you were mentioned.Message ID: @.***>
If we take the bootstrap resampling strategy (but each bootstrap is M samples where M << N), then each bootstrap can also be passed to the meta-learner. Whether we use the stacks
package directly or not, a similar strategy could be employeed:
https://stacks.tidymodels.org/index.html
graph TB;
style P1 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P2 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P3 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P4 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
style P5 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
P1[Model Input] --> |Fit P bootstrap samples| P2[MLP Models];
P1[Model Input] --> |Fit P bootstrap samples| P3[xgboost Models];
P1[Model Input] --> |Fit P bootstrap samples| P4[elastic net Models];
P2 --> P5[Meta Learner];
P3 --> P5;
P4 --> P5;
Hi All, @eva0marques @sigmafelix @mitchellmanware @dzilber @larapclark @MAKassien @Sanisha003
Following up on our discussion from today on a sampling strategy to alleviate memory pressure and runtime. @eva0marques @sigmafelix and myself outlined a strategy that embraces the multiple learners approach and fits in the S-T cross-validation strategies. The multiple models will also take advantage of the dynamic branching in targets
.
2 pictures show our schematic - the second one with some our random notes wiped away. In summary,
@kyle-messier @eva0marques @mitchellmanware
Related to cross-validation strategies-- I added a function extending the previous generate_cv_index
approach by employing anticlust::balanced_clustering
and various preprocessing options (raw, normalization, and standardization) to get partially overlapping space-time groups. The result with an option with standardization with the full dataset looks like the figure below:
Some distant subclusters or "seeds" are grouped into the same cluster because I am using the mean space-time coordinates of each seed to pick the top 10 closest pairs. We could discuss more about this implementation.
@kyle-messier So do we have an exact P value for model fitting? As mentioned in my comments above, the spatiotemporal grouping with overlaps is implemented so the other two way needs to be added to the package then we're good to go for the model fitting phase.
Some discussion for big data considerations of the beethoven pipeline. As @eva0marques, @sigmafelix and others have pointed out "The problem: we have 1058 sensors 365.2 days 5 years = 1931908 observations with 3844 covariates. I unsurprisingly get an error message "Error: vector memory exhausted (limit reached?)"