Refactor Radial Distribution Function (RDF)

leifdenby commented 2 years ago

[ ] complete refactoring of rdf function
[ ] implement test

leifdenby commented 2 years ago

@martinjanssens I'm a bit confused about what the RDF function should be returning. I guess what it is calculating internally is the probability distribution function as a density (i.e. probability of funding a mask at a given distance). Currently that is as a function of radial distance r, calling the distribution f(r). The scalar values that this function can return are then different modes of this distribution, is that right?

Is there any reason why the distance couldn't be given in pixels? I realise that fractional indices don't make a lot of sense since the pixel indices are integers, but that shouldn't matter for the distribution, no? What I am suggesting is that the grid resolution becomes an optional argument and if provided then it is assumed to be in meters and all the returned values scaled by this number. Does that make sense?

martinjanssens commented 2 years ago

Yes, I think your analysis is completely right and your proposal makes sense. As you say f(r) is in principle dimensionless in and of itself, which shouldn't change if dx is given in pixels or physical units (as long as L=n*dx is consistently updated). It might be useful to be able to specify a variable dr, say, to this function, as that's essentially the bin width used to construct f(r), but again that could be in pixels. The metrics I computed from f(r) were simply is maximum, the difference between the maximum and the minimum, and its integral over r; the first of these proved the most expressive for the fields I was looking at. It is, I think, then already a non-dimensional number, which wouldn't change with dx anyway.

leifdenby commented 2 years ago

Thanks for your feedback @martinjanssens!

I've spent this morning working on this and here are a few thoughts:

I'm a bit stuck with the S (domain size) argument to the rdf functions. I think the way it works is that for periodic domains we have copied the objects that wrap the boundary (?) and so the size of the domain is actually smaller than the shape of the object_labels 2D array. But I'm a bit confused about the normalisation constant in pair_correlation_2d (https://github.com/cloudsci/cloudmetrics/pull/43/files#diff-9ad196211d6f5181f667aef506ed9fcf4396dc697121d952f007681da1e9123aR63). Could you maybe explain how using periodic vs non-periodic domains should effect RDF calculation in your mind?
I've come across a package that computes RDF from coordinates rdfpy. It has a nice interface and so at least for non-periodic data I think we can use it. I don't think it assumes point coordinates are in a periodic domain. It's also parallel which is neat so it should be quite a bit faster :)
I've refactored some code I wrote to do Poisson Disc Sampling but modified so that I can control the nearest-neighbor distance by sampling a distribution. I've created a notebook demonstrating its use. I was thinking we could use this for creating a test since we should know the first peak of the RDF when we synthesize the data this way. If you have a chance maybe you could play around with it? I'm a bit confused that the original pairwise distance method you implement has non-zero values below the first peak, but maybe that is to be expected?

rdf_example

All of this is making me think that maybe we should push RDF to v0.3.0. I think it will take quite a bit more experimentation and refactoring to get it to a state where it's quite ready. What do you think?

leifdenby commented 2 years ago

All of this is making me think that maybe we should push RDF to v0.3.0. I think it will take quite a bit more experimentation and refactoring to get it to a state where it's quite ready. What do you think?

I'm so slow I hadn't even realised we'd already agreed to do this :) sorry

martinjanssens commented 2 years ago

Great, thanks for looking at this in so much detail! I've had a look at the code, and I think you're right to be confused regarding point 1. I think this was my thought:

pair_correlation_2d would be called inside a function (e.g. compute_rdf) where a mask with shape domain_shape and a bunch of labelled objects' positions (pos) have already been computed.
if periodic_domain==True, compute_rdf would first contain a similar routine as we have for iorg, where we i) move centroids outside the original domain back into that domain and ii) set domain_size to half the expanded shape.
Call pair_correlation_2 with the new pos and domain_size; it can then be thought of to operate in the original domain again for both periodic and open BCs, and I think the normalisations make sense then

Do you agree? :)

martinjanssens commented 2 years ago

Using an existing package is a great idea, as is using your Poisson disc sampling as a reference case! I can take a look at where the differences with our implementation come from, but they may very well be discretisation artefacts related to binning the RDF, which I think you can theoretically correct for (and we don't).

martinjanssens commented 2 years ago

And finally, yes let's pick this up for the next version, I agree we'll probably be playing around with this for a while still!

cloudsci / cloudmetrics

Refactor Radial Distribution Function (RDF) #43