NIEHS / PrestoGP

Penalized Regression on Spatiotemporal Outcomes using Gaussian Processes a.k.a. PrestoGP
https://niehs.github.io/PrestoGP/
0 stars 0 forks source link

Max-min ordering and nearest neighbor calculation could be optimized #40

Open ericbair-sciome opened 6 months ago

ericbair-sciome commented 6 months ago

I believe that our current functions for computing the max-min ordering and nearest neighbors could be improved to reduce running time. Specifically, I believe our current implementations are O(n^2) but O(n*log(n)) algorithms exist. See this paper for details:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6707751/

I think there are other optimizations that would give us more bang for our buck, so this is low priority for now. But I wanted to log it so we don't forget about it.

kyle-messier commented 6 months ago

@ericbair-sciome why are we not able to utilize the NN max-min ordering from GPvecchia? or something from the FNN cran package?

ericbair-sciome commented 6 months ago

We actually had an internal discussion about this at Sciome a week or two ago. Deepak said that at some point before I started working on this project, there was a desire to calculate the max-min ordering and nearest neighbors for non-Euclidean distances. The GPvecchia functions have Euclidean distance hard coded, and there is no way to override that default.

My current plan is to use the GPvecchia functions when we are using Euclidean distance and then use our custom functions when a user specifies a non-Euclidean distance. I have wondered if even that is worth the trouble. I'm tempted to just always use Euclidean distance to calculate (using the GPvecchia functions) the max-min ordering/nearest neighbors and call it a day. (Users would still be able to pas a non-Euclidean distance to the Matern model. Only the ordering and nearest neighbors would use Euclidean distance.) I have found that for reasonably large m, the choice of nearest neighbors doesn't seem to make much difference (provided that the choice is not completely horrible). And I don't think an O(n*log(n)) algorithm for computing nearest neighbors even exists in general for non-Euclidean distances. So, I've wondered if it's even worth giving the user this option.

However, if you can imagine any circumstance at all where a user might want to use a non-Euclidean distance function for nearest neighbors, then I will leave it in there. The code is already written; it's just a little slow. The only reason to take it out is if we decided that there is never a good reason to do that and we wanted to avoid giving users the proverbial rope with which to hang themselves.