NPPC-UK / ct_scanner_plotting

Plotting data from the microCT scanner
1 stars 0 forks source link

A more intelligent approach to filtering seeds? #1

Closed maxf130 closed 5 years ago

maxf130 commented 5 years ago

We should try to have a more intelligent approach to filtering seeds. We have tried:

Both of these approaches suffer from the problem of selecting the threshold. Just how close is implausible, just how large is implausible.

We should try:

maxf130 commented 5 years ago

I tried to implement a gaussian mixture model to predict whether a seed belongs to the set of pseudo seeds segmented at the beak, or the set of real seeds (see SHA: 18017ee5c64cc6949aef4de17eebccf9bb5b7760).

It did not work. Invariably lots of real seeds were classified as belonging to the set of pseudo seeds. Thinking about it, this is not surprising. The pseudo seeds are effectivly some noise around a point source (the tip of the beak). Treating them as belonging to a normal distribution is not unreasonable. The rest of the pod is however most certainly not gaussian. It is therefore quite unreasonable to expect this to ever work.

maxf130 commented 5 years ago

I think the gaussian mixture model approach failed because of the seeds of the pod are not normally distributed. It is more likely to be a uniform distribution specified by the bounds of the pod. We should try a more elaborate model. Sources of things segmented as seeds:

The model has some external relevant parameters:

And lots of parameters that are less interesting:

Vary the above parameters to maximise the fit of the model to the observed data (segmented seeds). There may not be enough seeds per pod to have enough resolution. In that case try the whole thing on the agregated data per plant or per genotype.

To be clear. I have absolutely no idea whether this is a reasonable approach from a statistical analysis perspective, I am well outside of my comfort zone.

maxf130 commented 5 years ago

Looking at scikit-learn for the above, I also came across more direct clustering methods. K-Means clustering might(?) be appropriate: https://scikit-learn.org/stable/modules/clustering.html#k-means

maxf130 commented 5 years ago

According to this answer: https://stats.stackexchange.com/a/40475 K means clustering on 1-d data is silly. It suggests alternatives:

maxf130 commented 5 years ago

Jenks natural break optimization on the level of pods resulted in loosing too many real seeds. I should try this on the level of the plant or genotype.

maxf130 commented 5 years ago

Naive Jenks filtering doesn't work very well on pod, plant or genotype level. Current state is found in SHA: 35ebd5d

maxf130 commented 5 years ago

I just tried implementing KDE filtering at the Genotype level. A first attempt has worked spectacularly well! SHA: 9e4d318

maxf130 commented 5 years ago

I submitted and merged pull request #4 that fully implements KDE filtering.