Stratified partitioning

damondamondamon commented 9 months ago

Dear spmodel-Team,

thanks a lot for this helpful package!

I was using your package within a inference context to estimate the effect of auxiliary variables on my target variable within a spatial setting. These auxiliary variables are to some extent categorical. Something like:

"Yield ~ Soil_Type (Categorical) + Elevation (Continuous) + etc"

When working with large datasets (n > 10.000 with partition size > 500), I sometimes get the warning message:

At least one partition's inverse covariance matrix is singular. Redjusting using var_adjust = "none". (I guess it should be readjusting?)

While this could be due to some real singularity in the covariance matrix, I rather assume that it occurs due to the rather unbalanced distribution in my categorical variables (e.g. two soil types with imbalanced 90%/10% distribution).

Related questions / suggestions:

What does actually happen, when you show this warning? Do you simply ignore that partitions result and average over all remaining partitions? Or do you simply adjust the corresponding partitions regression to non-spatial?
I would suggest to enable the option of stratified sampling to have all categorical levels represented in each partition even though this case is not covered in your paper; not sure whether this is in conflict with the statistical properties that you present there.

Minor side question: When using spmodel for inference (in my case n = 6.000) and defining the spcov_type as "none" (just as a non-spatial reference) and define local explicitly to "FALSE", I would still assume a routine that is equivalent to lm. Still, the computational time is exceptionally high (lm ~1sec, splm > 500sec). Will try to add replicable example.

michaeldumelle commented 9 months ago

Thanks @damondamondamon for the issue and for the kind words about the software! A few thoughts below:

Yes, it should read "readjusting."
You are correct that the message has to do with the unbalanced distribution in your categorical variables. If one of the partitions does not have both soil types, you will receive the message you did. You can supply your own partitions via index to local, and this is what I recommend you do to avoid the message. By default, spmodel uses kmeans() to select the indexes, so you could do this on your own, and if you have a partition without both soil types, either select a new set of partitions or rearrange a few observations. Also see my answer to "Related question 2" below.
Related question 1: This paper outlines our big data methods for spatial models. Equation 13 is the default variance formula we use (var_adjust = "theoretical" to local) to compute the variances of the explanatory variable slopes estimates. Notice that there are $V_{i,i}^{-1}$ terms, which end up being singular when both soil types are not observed in the corresponding $X_i$ matrix. When var_adjust = "none" to local , only $T{xx}^{-1}$ from Equation 13 is used to compute the variances of the explanatory variable slope estimates. The important takeaway is that we use a "shortcut" to fit the model to large spatial data and then must use Equation 13 to get the theoretically correct variances of the explanatory variable slope estimates. If we don't apply this adjustment (or we can't because one of the $V{i,i}$ terms are singular), the explanatory variable slope estimates tend to be a little too small, leading to narrower confidence intervals. Regardless of the var_adjust type, the spatial covariance parameter estimates and explanatory variable estimates are the same (all that changes is the variance of the explanatory variable slope estimates).
Related question 2: This is a great suggestion, but we are unlikely to implement this directly in spmodel in the near future, as there are already some pieces of software in the tidymodels ecosystem (rsample and spatialsample) designed for this. You can use these pieces of software to partition your data and use the partitions to create the vector that is the index argument to local.
Minor side question: The team is aware of this and plans to fix it. When there are no random effects specified via the random argument to splm(), the model should be equivalent to lm() and fit much faster. However, this fix may take a bit of time to implement as we need to create a custom routine for it that works with everything else. We did not do this originally because we believed people would primarily use lm() to fit non-spatial models, but we now see the utility in making model comparisons between spatial and non-spatial models directly using spmodel's existing architecture, which necessitates fitting a model with splm(..., spcov_type = "none").

Please let us know if you have any additional questions!

michaeldumelle commented 8 months ago

@damondamondamon we have improved the efficiency of splm() when there are no random effects (see here).

You can download the development version of spmodel by running

remotes::install_github("USEPA/spmodel", ref = "develop")

This fix will be part of the next CRAN update (the current version on CRAN is 0.5.1).

I will go ahead and close the issue but please reach out if anything else comes up!

USEPA / spmodel

Stratified partitioning #15