creare-com / podpac

Pipeline for Observational Data Processing Analysis and Collaboration
https://podpac.org
Apache License 2.0
43 stars 6 forks source link

Efficient Computation of Statistics for a Non-linear N-dimensional Function with Dependent Variables #508

Open CFoye-Creare opened 1 year ago

CFoye-Creare commented 1 year ago

Objective Develop an efficient method to compute the statistics s(c) for a given non-linear n-dimensional function f(x, y) with dependent variables x and y, such that F(f(s(x), s(y))) = s(c) while minimizing computational expense.

Background The non-linear n-dimensional function f(x, y) has dependent variables x and y, and is computationally expensive when directly calculating s(f(x, y)) to obtain s(c). Instead, we want to find an efficient approach to compute F(f(s(x), s(y))) to derive s(c) with reduced computational cost.

Input:

  1. A non-linear n-dimensional function f(x, y)
  2. Statistics on x and y, denoted s(x) and s(y)

Output: Statistics on the function output, denoted s(c), such that F(s(x), s(y)) = s(c). For example, given two PDFs P(x) and P(y), we want to approximate F such that F(P(x), P(y)) = PDF(c)

Constraints: The proposed method should significantly reduce computational cost compared to directly calculating s(f(x, y)) to obtain s(c).

Evaluation Metrics: The efficiency of the proposed method will be evaluated based on the following criteria:

  1. Accuracy: The computed s(c) should be accurate and comparable to the result obtained from s(f(x, y)).
  2. Computational cost: The proposed method should demonstrate a significant reduction in computational cost compared to calculating s(f(x, y)) directly.
  3. Scalability: The method should be able to handle large-scale problems with high-dimensional functions and large datasets for x and y.
  4. Robustness: The method should be robust to variations in the function and input data.

Deliverables

CFoye-Creare commented 1 year ago

Some possible approaches include:

  1. Surrogate Modeling Using surrogate models, like Gaussian Process Regression or Radial Basis Function networks, to approximate the function f(x, y) and then compute F(f(s(x), s(y))). These models can help in reducing the computational cost by providing a fast approximation of the function. Relevant Resource
  2. Sparse Grid Techniques Using sparse grid techniques to approximate the function f(x, y) and then compute F(f(s(x), s(y))). Sparse grids can help in reducing the computational cost by effectively handling high-dimensional functions with fewer grid points. Relevant Resource
  3. Machine Learning Leveraging machine learning techniques, such as neural networks or support vector machines, to learn the underlying relationship between the input and output of the function f(x, y) and then compute F(f(s(x), s(y))). These techniques can help in reducing the computational cost by learning an approximation of the function. Relevant Resource
CFoye-Creare commented 1 year ago

Also notable is the application of dimensionality reduction techniques before building surrogate models or approximations.

  1. Principal Component Analysis (PCA): PCA is a linear dimensionality reduction technique that transforms the input data into a new coordinate system by finding orthogonal axes (principal components) that capture the most variance in the data. The principal components are linear combinations of the original features, and the transformed data can be represented with fewer dimensions by retaining only the components that account for the most significant variance. Relevant Paper: Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202. (https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202)
  2. t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a non-linear dimensionality reduction technique that maps high-dimensional data to a lower-dimensional space while preserving local structures. It measures pairwise similarities between data points in the high-dimensional space and the lower-dimensional space, and minimizes the divergence between these similarity distributions using a gradient descent approach. t-SNE is particularly useful for visualizing complex data structures. Relevant paper: van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579-2605. (https://jmlr.org/papers/v9/vandermaaten08a.html)
  3. Uniform Manifold Approximation and Projection (UMAP): UMAP is a non-linear dimensionality reduction technique based on manifold learning and topology. It approximates the high-dimensional manifold structure by constructing a graph representation of the data and then optimizes an embedding in the lower-dimensional space to preserve both local and global structures. UMAP is computationally efficient and scalable, making it suitable for large-scale data analysis and visualization. Relevant paper: McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426. (https://arxiv.org/abs/1802.03426)
CFoye-Creare commented 1 year ago

Since this issue, we have evolved our approach. We are now using the Law of the Unconscious Statistician and Change-of-Variable to solve this problem. See the following:

Our approach became:

  1. Calculating a pdf of size n_bins for each coarse resolution grid square
  2. Sampling the centers of each bin and evaluating fine-scale soil-moisture
  3. Multiplying the evaluated soil moisture by the pdf to weight it correctly.

This commit shows this using toy data: ab6618a51819ba216dbdba134572193edc762f79 . This commit shows using real data: 27140d217985e3437e22d65c2eaee723aa981959

Here are some figures: MSE vs Bin Size for calculating mean soil moisture. image

And calculating variance: image