Big Data Considerations

kyle-messier commented 3 months ago

Some discussion for big data considerations of the beethoven pipeline. As @eva0marques, @sigmafelix and others have pointed out "The problem: we have 1058 sensors 365.2 days 5 years = 1931908 observations with 3844 covariates. I unsurprisingly get an error message "Error: vector memory exhausted (limit reached?)"

kyle-messier commented 3 months ago

Current Solution: GEO cluster

The current solution and one that I recommend is running the entire pipeline on the GEO cluster. With ~2TB of scatch, we can read in large dataframes and fit the models. While this is not the elegant solution, I think it will get the job done.

Spark and Sparklyr

GEO does have the native Spark and Hadoop installed. R has the Sparklyr https://spark.posit.co/. This will be more intelligent on read/write in a distributed manner.

This approach will likely take a bit more code and infrastructure development on our part, which is why I suggest we hold off for now. This is a future extension and we can get additional support to help extend it a truly more scalable fashion.

eva0marques commented 3 months ago

I also use to work with foreach package to deal with core dispatching. Maybe it's gonna be useful for parallelization on cross validation sets.

sigmafelix commented 3 months ago

@kyle-messier I agree on prioritizing the model and pipeline building now. Another consideration in dealing with large feature data -- When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset.

@eva0marques We are currently relying on future framework for parallelization since it has a variety of branched-out packages for automatic SLURM job submission. I think using foreach is a convenient and straightforward way to parallelize heavy workloads. We can discuss more about parallelization backends throughout the pipeline soon.

kyle-messier commented 2 months ago

@sigmafelix When you say "When the base models are fitted, I think that space-time cross validation set generation should be based on space-time coordinates only, rather than using the full dataset." Do you mean subsampled models or something else? We could also limit ourselves to purely spatial cross-validation as opposed to space-time CV.

sigmafelix commented 2 months ago

@kyle-messier I meant we will use a compact three-variable data.frame, sf, or SpatVector (I believe sf is what we will use as spatialsample uses sf objects) to get an rset object. I think full 2M*3.8K sf object is too big to handle even in GEO. rset created by rsample::vfold_cv is a nested tibble object, which will end up creating a massive tibble or sf with 20M effective rows. That said, we can utilize a reduced sf objects to create a rset object containing row indices for each training-test split then read necessary rows out of it for model fitting.

kyle-messier commented 2 months ago

Thanks @sigmafelix. It looks like rset is basically bootstrapping within the rsamples package. In the end, if getting the 20M+ rows plus large P into GEO to model is the main computational bottleneck - i.e. RAM - then I'm not opposed to splitting the base-learner-to-meta-learner process into smaller, overlapping regions. The spatialsample package could be helpful in that respect. Taking the ensemble approach we are adopting even further, we can fit say 10-50 spatial subsamples that are say 50 percent of the spatial domain. Still large models that should be relatively stable, but will hopefully bring us within the RAM capacity of GEO.

kyle-messier commented 2 months ago

@sigmafelix If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too.

sigmafelix commented 2 months ago

@kyle-messier Is randomness for variations in grid configuration across models or irregular grid generation? The latter is possible with mode="grid_quantile" in the current version. For the former, I think it requires a new function since par_make_grid takes any sf/SpatVector input and uses its extent to generate grids.

From: {SET}group @.> Sent: Saturday, April 6, 2024 22:10 To: NIEHS/beethoven @.> Cc: Insang Song @.>; Mention @.> Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325)

@sigmafelixhttps://github.com/sigmafelix If we take a divide-and-conquer strategy to model fitting, we could likely using the par_grid family of functions you made in chopin. Is there a way to have some randomness or variation introduced if we want to make say 5 or 10 grid sets such that a given region has multiple submodels? The padding will help with that too.

— Reply to this email directly, view it on GitHubhttps://github.com/NIEHS/beethoven/issues/325#issuecomment-2041277091, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCFCUVKIVNPK234E6JBGXDY4CTKZAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGI3TOMBZGE. You are receiving this because you were mentioned.Message ID: @.***>

kyle-messier commented 2 months ago

@sigmafelix I was thinking with irregular grid generation - perhaps like kmeans. But now that I think about it more, I think a spatial-block-cv approach would take care of it. For example, if the US domain was partitioned into 10-sections, then each section is fit and predicted on 9 models. This CV is at the highest level and I would not really consider it CV, but more of partitioning to deal with computation. The spatial or temporal CV we develop will be within each of these. Let me know what you think about this approach.

kyle-messier commented 2 months ago

Sorry @sigmafelix - that wouldn't do much for dealing with the RAM computational issues. We'd want the partitioned training sets to be ~50% of the size. The general idea was to ensure that every location still gets multiple partitioned models trained and validated on it. Perhaps it could be done wiwth the regular grid, plus variations of the grid_merge.

sigmafelix commented 2 months ago

@kyle-messier I agree on generating Irregular grids since the site locations are unequally distributed. We could make grids partially overlapping such that many locations will get multiple partitioned models; however, the some will get a single model unless we set a large overlapping distance. If we only consider spatial partitioning, the sample size per grid will be substantially reduced from the fact that one site having 3.8K*1.9K data elements.

kyle-messier commented 2 months ago

@sigmafelix in chopin, does your par_cut_coords function have some randomness component to it? Looking at the code, it does not appear so, thus running on the same group of coordinates will always produce the same set. There is a package anticlust https://cran.r-project.org/web/packages/anticlust/index.html that can create roughly equal size clustering groups. We could potentially calculate say 10 - 50 groups of 2 (i.e. 50% of the data) and pick the top 5 or 10 that ensure all of the AQS points are in say 4/5 or 9/10 of the groups.

In terms of partitioning, do you think cutting our overall spatial sample size in half, will allow us to run the model on GEO relatively easiliy?

sigmafelix commented 2 months ago

@kyle-messier anticlust seems to operate in conversely from k-means. Do we want to maximize between-group similarity? For balancing the group sizes, I think we can consider making a "seed" set of clusters from k-means or k-medoids then rotate the membership along the boundary of two adjacent clusters.

My calculation is the full dataset will be ~30GB (if the precision is limited 4 Bytes) to ~60GB (8 Bytes), which is well below GEO's memory limit. Memory consumption of each model will be the key factor of feasibility of model fitting. I will test some with my laptop then get an estimation.

kyle-messier commented 2 months ago

@sigmafelix Ok, that back of the envelope calc is not as big as I thought. Nonetheless, yes my thought was that something like anitclust would make it easy to create 10 sets of roughly equal partitions of the whole US dataset into 33% or 50% of the total size. That would ensure that every point is still included an equal number of times in the high-level ensemble. Perhaps just doing some high-level kmeans/medoids as a you suggest is easier and good enough.

kyle-messier commented 2 months ago

anticlust::balanced_clustering creates perfectly even clusters. As you mentioned, it maximizes between cluster similarity. It is also deterministic as it always starts with the center cluster. So an easy way to improve computational efficiency with a divide-and-conquer approach would be to:

Create K = 10 equal clusters
Create 5 choose 3 (or maybe 2) models = 10 submodels. Each K group will be in 6 models (4 choose 2).
Each of these partitioned submodels then goes through the base learner to meta learner
Combine everything

So a bit ad hoc, but would add some non-stationarity to the model while also reducing computational burden of RAM (may take longer, but that is okay). We can discuss more when @sigmafelix is back in a couple weeks.

@dzilber Do you know a way we can approximate storage of the model fitting process?

dzilber commented 2 months ago

@kyle-messier One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients from the last layer that was updated. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate.

sigmafelix commented 2 months ago

@kyle-messier Got it. We could implement the approach in the pipeline. I think I missed something in 2. Is 5 randomly chosen from 10 or does mean something else? We have three base learners (RF/XGB/ANN; maybe +GP = 4 base learners) and ten equal-size clusters, which made me a little bit confused to interpret the combination step.

From: dzilber @.> Sent: Thursday, April 11, 2024 16:46 To: NIEHS/beethoven @.> Cc: Insang Song @.>; Mention @.> Subject: Re: [NIEHS/beethoven] Big Data Considerations (Issue #325)

@kyle-messierhttps://github.com/kyle-messier One part of memory complexity is just a count of all the parameters of the model. The trickier part is keeping track of the auxiliary terms you need, like gradient or error vectors or Hessian matrices. The implementation can have a big effect on the memory cost of fitting the model. For example, you can keep track of all of the gradients for a neural network with each iteration, or you can just save the gradients for each layer. Since we are using off the shelf packages, we might have to check their documentation or run some experiments and extrapolate.

— Reply to this email directly, view it on GitHubhttps://github.com/NIEHS/beethoven/issues/325#issuecomment-2050508946, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGCFCUUSFPK6ORQ3CCDDV4LY43ZBRAVCNFSM6AAAAABFV7QOKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGUYDQOJUGY. You are receiving this because you were mentioned.Message ID: @.***>

kyle-messier commented 2 months ago

Some links on data and model usage

sigmafelix commented 3 weeks ago

Test run with the entire dataset on GEO: assess memory pressure, etc.
Determine a data splitting strategy
Combine-and-ensemble

kyle-messier commented 3 weeks ago

Try fitting the base learners on the full data set ( Brulee MLP, xgboost, Elastic-Net)
If needed, discuss partitioning strategy

kyle-messier commented 3 weeks ago

If we take the bootstrap resampling strategy (but each bootstrap is M samples where M << N), then each bootstrap can also be passed to the meta-learner. Whether we use the stacks package directly or not, a similar strategy could be employeed: https://stacks.tidymodels.org/index.html

graph TB;

    style P1 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P2 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P3 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P4 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;
    style P5 fill:#91bcfd , stroke:#333, stroke-width:2px, rounded:true;

    P1[Model Input] --> |Fit P bootstrap samples| P2[MLP Models];
    P1[Model Input] --> |Fit P bootstrap samples| P3[xgboost Models];
    P1[Model Input] --> |Fit P bootstrap samples| P4[elastic net Models];
    P2  -->  P5[Meta Learner];
    P3  --> P5;
    P4  -->  P5;

kyle-messier commented 3 weeks ago

Hi All, @eva0marques @sigmafelix @mitchellmanware @dzilber @larapclark @MAKassien @Sanisha003

Following up on our discussion from today on a sampling strategy to alleviate memory pressure and runtime. @eva0marques @sigmafelix and myself outlined a strategy that embraces the multiple learners approach and fits in the S-T cross-validation strategies. The multiple models will also take advantage of the dynamic branching in targets.

2 pictures show our schematic - the second one with some our random notes wiped away. In summary,

P row bootstrap subsamples are taken for each of the base learners
M bootstrap percentage taken to be a fixed 30%, so ~300K S/T samples,
P x 3 base learner outputs
Each bootstrap model employs a Space-Time cross-validation strategy
We think ~3 different S-T CV strategies would be good. So 1/3 chance a given model gets a given CV strategy.
3 strategies have different focus: 1 spatial, 1 temporal (e.g. season, month), and 1 edge case such as quantile of imperviousness or rural/urban
Meta learner is essentially just another bootstrap resample of the base learners
Meta learner prediction is the prediction of Q bootstrapped sampled learners (i.e. posterior predictions can easily provide mean, variance, or percentile)

sigmafelix commented 2 weeks ago

@kyle-messier @eva0marques @mitchellmanware

Related to cross-validation strategies-- I added a function extending the previous generate_cv_index approach by employing anticlust::balanced_clustering and various preprocessing options (raw, normalization, and standardization) to get partially overlapping space-time groups. The result with an option with standardization with the full dataset looks like the figure below:

Some distant subclusters or "seeds" are grouped into the same cluster because I am using the mean space-time coordinates of each seed to pick the top 10 closest pairs. We could discuss more about this implementation.

Viewing angle=40 (roughly from the southeastern edge)

Viewing angle=215 (roughly from the southwestern edge)

sigmafelix commented 2 weeks ago

@kyle-messier So do we have an exact P value for model fitting? As mentioned in my comments above, the spatiotemporal grouping with overlaps is implemented so the other two way needs to be added to the package then we're good to go for the model fitting phase.

NIEHS / beethoven