seeking recommendations for modeling with big data

ptitle commented 3 months ago

Hi,

In some experimentation with splm() and the local argument, I am seeing stochasticity in the fitted model parameters, which is expected, since there will be variation in how the large dataset is partitioned.

Do you have recommendations for how to reduce that stochasticity and increase stability of the fitted model?

For instance, increasing the size argument seems to lead to less variation in fitted parameter values. So perhaps increasing this argument as much as computationally possible? Any other arguments to change, or some combination? Or perhaps taking means of fitted parameter values across multiple repeats of model fitting?

Thank you - spmodel seems like a fantastic addition to the spatial modeling toolkit!

michaeldumelle commented 3 months ago

Thanks for the inquiry @ptitle and for the kind words about the software! I apologize for the delayed response; I have been out for the past few weeks on work-related travel!

First, you could select your own spatial index and pass it to local so that your results don't vary, but this doesn't really address your fundamental question. The nuances of the local argument are covered in detail in this publication and this publication (and in some detail in the splm() documentation linked here). In short, increasing the size of the partitions will reduce variability in the covariance parameter estimates. However, while the covariance parameter estimates can vary from partitioning to partitioning, it is rare for the fixed effect (i.e., slope) estimates (and their standard errors) to change much between partitionings. The reason for this has to do with the fact that the covariance parameters actually aren't uniquely estimable, but the covariance they imply (which is used to compute the fixed effect estimates and their standard errors) generally is uniquely estimable (Zimmerman and Ver Hoef, 2024)! And in fact, in those previously linked articles, we show that while it is possible covariance parameter estimates vary from partitioning to partitioning, the statistical properties of the fixed effect estimators behave as intended (are unbiased, have proper confidence interval coverage, etc.). My recommendation is to first see if your fixed effect estimates (and their standard errors) notably change across different partitionings, and if they don't, proceed with one of the model fits. If they do notably change across different partitionings, please follow up here. Also, if you have any other questions, please don't hesitate to reach out!

I think your "model averaging" idea is an interesting one, but we haven't explicitly studied this type of question, so at this moment I can't speak to whether or not it would be effective.

References

Zimmerman, D.L., & Ver Hoef, J.M. (2024). Spatial Linear Models for Environmental Data (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429060878

ptitle commented 3 months ago

Thanks! I'll follow up if I find that fixed effect parameters are insufficiently stable.

ptitle commented 3 months ago

Hi @michaeldumelle ,

I did an experiment with an empirical dataset. This dataset has 8100 observations, and in this case I'm including a response and 3 predictors.

I was able to fit the model without the big data approximation (took a little while), and this is what I get:

Call:
splm(formula = het_sizeNN ~ log(epmReduced) + log(het_topoComplexity) +
    log(het_habitat), data = het_cellDF, spcov_type = "exponential",
    xcoord = "x", ycoord = "y", local = FALSE)

Residuals:
     Min       1Q   Median       3Q      Max
-0.27116 -0.01011  0.02647  0.07755  3.57455

Coefficients (fixed):
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)              0.456453   0.090852   5.024 5.06e-07 ***
log(epmReduced)         -0.231962   0.005892 -39.366  < 2e-16 ***
log(het_topoComplexity)  0.004141   0.002153   1.923   0.0544 .
log(het_habitat)         0.014591   0.007391   1.974   0.0484 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Pseudo R-squared: 0.1613

Coefficients (exponential spatial covariance):
       de        ie     range
8.449e-02 2.245e-07 5.981e+05

I then ran 1000 replicate runs of the same model but with local = list(method = "kmeans", size = 100, var_adjust = "theoretical", parallel = TRUE, ncores = 6)

Here I am showing parameter estimates across the 1000 runs as histograms, and the red vertical bars represent those parameter estimates from the local = FALSE model.

Screenshot 2024-06-07 at 14 43 57

It's important to be aware of the scale, so in a way, the variation in slope and standard error is relatively small. But that variation is pretty wide regarding the p-values. And perhaps more importantly, the red bar is often in the tails of these histograms.

Is this just to be expected, given exactly what is being done here? I suppose it makes sense that standard error would be smaller with the dataset fitted all at the same time.

Thanks for your time!

michaeldumelle commented 2 months ago

Thanks so much @ptitle for looking into this more! As you mentioned, the estimates and standard errors are very close to one another (in an absolute sense) and, as previously noted we generally know that on average, spatial indexing tends to yield confidence intervals for the slope parameters that have proper coverage. However, it is interesting here that the true quantities tend to fall in the tails of these histograms, which could be an artifact of this particular data set, or it could be suggestive of underlying behavior requiring further thought. Notably, the standard errors for spatial indexing were larger than for the full data set for all three slope parameters. I will think on this some more, point my collaborators to this, and reach back out if necessary.

michaeldumelle commented 1 month ago

Hi @ptitle, I have had some time to think about this more and chat with collaborators. First, it is worth mentioning again (for future readers of this issue) that the slope estimates and standard errors for the spatial indexing models are very close to those of the full data model (in an absolute sense). Second, the lower standard errors for the slopes in full data model make sense to us, as using all the data should yield a model having more precise (i.e., lower standard errors) parameter estimates. Spatial indexing can be thought of as lessening the sample size by which you estimate covariance parameters, which increases uncertainty. So while the spatial indexing slope estimates have proper coverage (as previously discussed), their standard errors should tend to be larger than the slope standard errors in the full data model. Third, we think the bias in the spatial indexing parameter estimates may have to do with 1) this being a singular data set and 2) using the “kmeans” partitioning. The 1000 random trials uses 1000 different k-means starting values with 100 clusters. Because your spatial locations are not changing, we can’t envision these k-means partitions being wildly different between trials, so they are resampling the difference between the full data model and spatial indexing for this particular dataset. There will always be some difference between the full model and spatial indexing. However, maybe the “random” method would yield slope estimates that show the red lines closer to the center of the empirical distributions of the slopes from spatial indexing?

In short, the takeaway is that spatial indexing yields models that have proper behavior (unbiased, proper coverage for slope parameters and predictions) but they are going to be slightly less precise/informative than the corresponding quantities from the full data model.

ptitle commented 1 month ago

Hi @michaeldumelle, thanks so much for this follow-up, and I appreciate that you brought this to your coauthors as well!

I did try redoing this with the "random" method. No real improvement, as far as I can tell. If anything, parameter estimate distributions are further from those values estimated all at once?

Screenshot 2024-07-22 at 15 59 36

But regardless, obviously, one should not benchmark performance of a method on a single dataset. It could be that if this procedure was done across a number of datasets, the overall picture would be different. I was just reporting what I was seeing, and making sure I was thinking about this correctly.

Thanks again for your time!

USEPA / spmodel

seeking recommendations for modeling with big data #21

References