A more intelligent approach to filtering seeds?

maxf130 commented 5 years ago

We should try to have a more intelligent approach to filtering seeds. We have tried:

Removing seeds that are implausibly close to the ends of the pod
Removing seeds that have an implausibly large diameter relative to the size of the pod at that point

Both of these approaches suffer from the problem of selecting the threshold. Just how close is implausible, just how large is implausible.

We should try:

[x] ~~Model the distribution of grains as a sum of two gaussians (I think this is a mixture model). There is some noise at the end of the beak, and then the remaining grains. It should be possible to distinguish between the two groups.~~
[ ] A more complicated model, with uniform distribution in the pod and gaussian noise at beak tips.
[x] ~~K-Means clustering~~
[ ] Kernel density estimation - breaking at minimal kernel density
[x] Jenks natural breaks optimization
[ ] Expectation-maximization algorithm

maxf130 commented 5 years ago

I tried to implement a gaussian mixture model to predict whether a seed belongs to the set of pseudo seeds segmented at the beak, or the set of real seeds (see SHA: 18017ee5c64cc6949aef4de17eebccf9bb5b7760).

It did not work. Invariably lots of real seeds were classified as belonging to the set of pseudo seeds. Thinking about it, this is not surprising. The pseudo seeds are effectivly some noise around a point source (the tip of the beak). Treating them as belonging to a normal distribution is not unreasonable. The rest of the pod is however most certainly not gaussian. It is therefore quite unreasonable to expect this to ever work.

maxf130 commented 5 years ago

I think the gaussian mixture model approach failed because of the seeds of the pod are not normally distributed. It is more likely to be a uniform distribution specified by the bounds of the pod. We should try a more elaborate model. Sources of things segmented as seeds:

Pseudo seeds at the tip of the beak (normal)
Pseudo seeds due to noise in the beak (uniform)
Pseudo seeds at the pod ends (minor) (normal)
Real seeds distributed uniformly throughout the pod (ish) (uniform)

The model has some external relevant parameters:

Lenght of the pod
Length of the beak
Seed density in the pod

And lots of parameters that are less interesting:

Standard deviation and coefficient of the beak tip noise
Pseudo seed density in the beak
standard deviation and coefficient of the pod tip noise

Vary the above parameters to maximise the fit of the model to the observed data (segmented seeds). There may not be enough seeds per pod to have enough resolution. In that case try the whole thing on the agregated data per plant or per genotype.

To be clear. I have absolutely no idea whether this is a reasonable approach from a statistical analysis perspective, I am well outside of my comfort zone.

maxf130 commented 5 years ago

Looking at scikit-learn for the above, I also came across more direct clustering methods. K-Means clustering might(?) be appropriate: https://scikit-learn.org/stable/modules/clustering.html#k-means

maxf130 commented 5 years ago

According to this answer: https://stats.stackexchange.com/a/40475 K means clustering on 1-d data is silly. It suggests alternatives:

maxf130 commented 5 years ago

Jenks natural break optimization on the level of pods resulted in loosing too many real seeds. I should try this on the level of the plant or genotype.

maxf130 commented 5 years ago

Naive Jenks filtering doesn't work very well on pod, plant or genotype level. Current state is found in SHA: 35ebd5d

maxf130 commented 5 years ago

I just tried implementing KDE filtering at the Genotype level. A first attempt has worked spectacularly well! SHA: 9e4d318

maxf130 commented 5 years ago

I submitted and merged pull request #4 that fully implements KDE filtering.

NPPC-UK / ct_scanner_plotting

A more intelligent approach to filtering seeds? #1