Refactor cloud size distribution (CSD)

leifdenby commented 2 years ago

[ ] complete refactor of cloudmetrics.objects.metrics.csd.csd(cloud_labels=...)
[ ] implement tests for powerlaw and supercritical percolation model fits

martinjanssens commented 2 years ago

Nice that you've started this one, @leifdenby thanks for that! The CSD metrics are among the more impractical ones to use, I've found, as quite some manual tuning of the binning routine is needed to make it work consistently over a dataset (currently it's tuned to the MODIS data, but that doesn't work very well with LES output, for instance). I'm not really offering solutions, I'm afraid, just making a note of this in case you have a brilliant idea :)

leifdenby commented 2 years ago

The CSD metrics are among the more impractical ones to use, I've found, as quite some manual tuning of the binning routine is needed to make it work consistently over a dataset (currently it's tuned to the MODIS data, but that doesn't work very well with LES output, for instance). I'm not really offering solutions, I'm afraid, just making a note of this in case you have a brilliant idea :)

Hmm, maybe the domain needs to be quite big for there to be reasonably sized population of cloud objects. There are techniques for estimating a good bin size give the number of samples. Maybe we could use a Bayesian fit for the model and use the parameter uncertainty to warn when the model is a poor fit?

martinjanssens commented 2 years ago

This definitely works much better when there are many objects and the statistics converge reasonably. The trouble is, for many (smaller) fields, you might actually not have that many objects (<100). In the case of aggregated deep convection, you might even have <10.

Your suggestion is great, I think: It would be really nice to have an automatic binning procedure that has some inferrable pdfs over their free parameters. Did you have anything concrete in mind? I don't, but am up for some tinkering/googling. :)

leifdenby commented 2 years ago

Your suggestion is great, I think: It would be really nice to have an automatic binning procedure that has some inferrable pdfs over their free parameters. Did you have anything concrete in mind? I don't, but am up for some tinkering/googling. :)

Not exactly, and this might be getting into over-kill territory here :D Maybe what we need to do is start benchmarking some of these ideas. It would be really cool to know things like: a) how large does a domain need to be for different metrics to work well? b) how well do different metrics handle broad vs narrow size distributions? c) what is the effect of resolution on different metrics? But maybe I'm getting ahead of myself here. I recently wrote some code based on the idea of Poisson-Disc sampling (https://www.jasondavies.com/poisson-disc/) to place objects of varying size and a controlled (prescribed) distance (sampled from a distribution). That could be used to synthesize domains of different size but with a controlled amount of organisation. What do you think?

leifdenby commented 2 years ago

Not exactly

Well actually, this isn't quite true. I've had some luck applying pystan (https://pystan.readthedocs.io/en/latest/) to these kinds of applications before, this was for fitting a simple model of cloud-parcel rise. We could use pystan to fit a size-distribution model

martinjanssens commented 2 years ago

Wow, that looks very cool. If that's a rather stable package, then it might be really cool to use (I wouldn't guess the additional requirements of a C++ compiler are a showstopper for most people?). Do you have a feeling for whether it would be easy to implement into what we currently have?

martinjanssens commented 2 years ago

A colleague of mine came across a dedicated python package for fitting power laws: https://arxiv.org/abs/1305.0215 Pasting that link here for future reference in case we want to use it.

cloudsci / cloudmetrics

Refactor cloud size distribution (CSD) #44