Armand1 / Iznik

Iznik
0 stars 0 forks source link

testing distance methods for clustering motifs on colour #5

Open Armand1 opened 4 years ago

Armand1 commented 4 years ago

This summarizes results for testing various distance methods, and clustering methods, for the motif colour distributions.

The colour data are: allfreq_palette960620.csv The palette is: completely unsupervised, based on non-books, and slightly enhanced. It is palette060620.csv The ground truth data are: groundtruth060620

These results are based on 282 tulips that have been assigned to 6 ground truth classes

Armand1 commented 4 years ago

HCA clusters: Within- and among-ground truth class distances.

One method we can use to test distances is to examine the average distance within ground-truth classes to the average distance among them. The bigger the average distance, the better method.

We can invent a "specificity" measure. It works like this.

specificity<-dist_among_classes/dist_within_classes

Specificity >1 means that the mean within-class distance is smaller than the mean among-class distance; and the larger this ratio the better

distance metric Mean within-class Mean among-class Specificity
euclidean 0.482 0.641 1.33
KL (symmetric) 6.25 8.63 1.38
earthmover 0.182 0.300 1.65
weighted pairs (freq 0.2, colour 0.8) 1.02 1.29 1.26

So earthmover is doing best

Armand1 commented 4 years ago

HCA clusters: Rand index

Here we use each distance metric to construct a tree by HCA. We then cut the tree at k=6 since that is how many colour classes we have, to give us 6 clusters. We then use the adjusted Rand index to look at the correspondence between the original classes and those inferred by the HCA. We will cluster using Wardd2

distance metric adj Rand
euclidean 0.29
KL (symmetric) 0.27
earthmover 0.38
weighted pairs (freq 0.2, colour 0.8) 0.34

So earthmover is doing best by a small margin

Armand1 commented 4 years ago

HCA clusters: dendrograms are informative. Here the nodes have been colored by the most frequent colour.

Euclidean

Cluster very much on the dominant colour. And it's got a junk cluster (cluster 2 at the top)

Euclidean_dendrogram

colour distance earthmover

Earthmover is nicer --- but it really separates out the blues into different clusters colourdist_earthmover_dendrogram

colour distance weighted pairs (freq 0.2, colour 0.8)

The nice thing about this method (which heavily weights towards colour) is that it groups all the blues together. It's the only method that does so. It doesn't group the sub-groups of blues very well (it's still grouping dominantly by the precise shade of blue, which cuts across the a priori groups). colourdist_weightedpairs_dendrogram

Armand1 commented 4 years ago

HCA clusters: colors. We can also look at the distribution of colors among clusters. Here they are for earthmover and weighted pairs

earthmover

colourdist_earthmover_clusters_by_colour

colour distance weighted pairs (freq 0.2, colour 0.8)

colourdist_weightedpairs_clusters_by_colour

Armand1 commented 4 years ago

This the ground truth colors

Actually, this illustrates why it's fucking hard to separate out the blue groups --- they're really not that distinct.

It seems to me, looking at particular cases, that the ground truth could be improved. But I have to say that the palette seems to be missing some things. See below.

tulip groundtruth

Armand1 commented 4 years ago

GMBC09_0000_t

Here's an example of tulip that is the blue_red_green ground truth class. It clearly has red dots. But if you look at its palette --- it doesn't have any red. Is it getting sucked into the black? This also applies to GMBC09_0002_t. So, in this case you think it's not picking up reds in the entire image due to the internal clustering. But, with the saz leaf included, there's quite a lot of red.

GMBC09_0000_t

GMBC09_0000_t

Armand1 commented 4 years ago

VA58_0003_t This image is a bit puzzling. It looks very red, but there's not much red in the palette. I checked the image itself -- nothing obviously wrong with it. Here it is using the whole image and not the masked tulip. There are two other tulips on the same plate which aren't in the frequency data at all

VA58_0003_t VA58_0003_t

Armand1 commented 4 years ago

VA08_0001_t perhaps shouldnt be in blue-green class. Not much green. remove from blue-green gt This applies to VA08_0000_t, VA08_0002_t, VA08_0003_t, VA08_0004_t as well. They're all exactly the same

VA08_0001_t VA58_0003_t

Armand1 commented 4 years ago

Things we have talked about.

1) Some missing motifs. Where are they? Why are they not in the data? How many are missing? 2) At least some missing motifs are because the mask are not working, rather the entire image is being analysed. Find and fix. 3) We are missing some colours in some motifs. This appears to be due to minority colors. Perhaps increase the K at the second round of clustering to pick them up. (This may have a cost downstream in giving less accurate 18 colour palettes) 4) We probably want to reduce the number of blues and so on manually by looking at the expert palette and coming to a decision as to which "blues" really are the same. 5) I will look at the "expert" palette and think about what "blues" and so on to combine. 6) I will continue to look here at the "anomalous" ground truth motifs --- where their colours don't seem to be right.

Armand1 commented 4 years ago

problematic motifs --- where the palette does not correspond to gt classes

One thing that strikes me is that, anyway you cut it, there is not a big difference between the colours of "blue_red" and "blue_red_green" classes. I think they should be combined into a single class. That said, there are a number of tulips that really don't have green but in which grays are registering as green. Increasing the palette may solve this.

There are a number of other changes to the ground truth that should be made tulip groundtruth_edited

segment_id class problem solution
VA58_0003_t red has far too little red there is a motif finding problem
FMC16_0001_t blue_red has green in the palette and in the calyx move to blue_red_green
MET40_0000_t blue_red has green in the palette and some in the tulip increase palette
MET40_0001_t blue_red has green in the palette and some in the tulip increase palette
LOU03_0007_t blue_red has green in the palette and some in the tulip move to blue_red_green
LOU03_0008_t blue_red has green in the palette and some in the tulip move to blue_red_green
LOU01_0001_t blue_red has green in the palette but none in the tulip increase palette
MK05_0001_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
MK05_0003_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
MK05_0004_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
MK05_0006_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
SHM03_0000_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
SHM03_0001_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
SHM03_0002_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
SHM03_0003_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
SHM03_0004_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
SHM03_0005_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
SHM05_0006_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
VA25_0010_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
VA31_0000_t blue_red has green in the palette but very little in the tulip (more grey) increase palette
FMC10_0000_t blue_red has green in the tulip but some in the palette increase palette
FMC10_0001_t blue_red has green in the tulip but some in the palette increase palette
GMBC20_0004_t red_white_bluegreen has no real blue in either the tulip or the paletted move to red
GMBC09_0000_t blue_red_green has no red in the palette but does in the tulip increase palette
GMBC09_0001_t blue_red_green has no red in the palette but does in the tulip increase palette
GMBC09_0002_t blue_red_green has no red in the palette but does in the tulip increase palette
GMBC10_0000_t blue_red_green has no red in the palette but does in the tulip increase palette
GMBC10_0001_t blue_red_green has no red in the palette but does in the tulip increase palette
GMBC10_0002_t blue_red_green has no red in the palette but does in the tulip increase palette
GMBC10_0003_t blue_red_green has no red in the palette but does in the tulip increase palette
MET29_0000_t.jpg blue_red_green has no red in the palette but does in the tulip increase palette
MET29_0001_t.jpg blue_red_green has no red in the palette but does in the tulip increase palette
MET29_0002_t.jpg blue_red_green has no red in the palette but does in the tulip increase palette
MET29_0003_t.jpg blue_red_green has no red in the palette but does in the tulip increase palette
MET33_0000_t blue_red_green has no red in the palette but does in the tulip increase palette
MET33_0001_t blue_red_green has no red in the palette but does in the tulip increase palette
MET33_0002_t blue_red_green has no red in the palette but does in the tulip increase palette
VA64_0000_t blue_red_green has no red in the palette but does in the tulip increase palette
VA64_0001_t blue_red_green has no red in the palette but does in the tulip increase palette
VA64_0002_t blue_red_green has no red in the palette but does in the tulip increase palette
VA64_0003_t blue_red_green has no red in the palette but does in the tulip increase palette
BEN01_1_0002_t white_red_bluegreen has no red in the palette but does in the tulip increase palette
BEN01_2_0002_t white_red_bluegreen has no red in the palette but does in the tulip increase palette
VA08_0000_t blue_green the calyx is really blue move to blue
VA08_0001_t blue_green the calyx is really blue move to blue
VA08_0002_t blue_green the calyx is really blue move to blue
VA08_0003_t blue_green the calyx is really blue move to blue
VA08_0004_t blue_green the calyx is really blue move to blue
GMBC08_0001_t blue_green there is a lot of green in calyx; but green is not being picked up increase palette
BM54_0011_t blue_green there is very little green in the image; none captured in the palette move to blue
Armand1 commented 4 years ago

Missing tulips

In the ground truth we have 296 tulips. Of these only 275 are present in the frequency data. Where are the rest? I am guessing we are missing MANY tulips!

Update. We have a new groundtruth groundtruth090620.csv containing 349 motifs: every motif in here has frequency data now.

Armand1 commented 4 years ago

New ground truth This is a new groudtruth file

groundtruth080620.csv

It discards a few and has a "blue_green" class as before and "blue_green_red" class instead of separate "blue_red" and "blue_red_green" classes.

Using weighted distance pairs, as before, the Rand is 0.32 and Specificity is 1.20 which really isn't an improvement. The blue_green and blue_red are being distributed between the two blue classes and the red_white and red_white_bluegreen are not differentiated --- though there are a few big subclusters too

class cluster_1 cluster_2 cluster_4 cluster_3 cluster_5
1 blue_green 46.4 21.4 32.1 0 0
2 blue_green_red 36.1 33.3 29.6 0.9 0
3 red 0 0 3.8 0 96.2
4 red_white_bluegreen 0 0 4.3 10.9 84.8
5 white_red_bluegreen 2.6 2.6 5.1 89.7 0

colourdist_weightedpairs_clusters_by_colour

colourdist_weightedpairs_dendrogram

Armand1 commented 4 years ago

simplifyng the palette

clustering the 18 colour palette across all segmented motifs (jensen-shannon; wardD2, suggests that the following colours might be usefully unified. (in red)

colours

Armand1 commented 4 years ago

Another way of testing distances and palettes: Image monophyly

This borrows a trick from phylogenetics. It asks, for some dendrogram, if all tulips of a given image are more closely related to each other than they are to any other. If they are, then they are monophyletic. They could also be paraphyletic (something else is found within them) or polyphyletic (split between two sub clusters). In general, we should prefer things to be monophyletic rather than polyphyletic.

Reducing the colours tends to make things less monophyletic. This is because they are less distinguishable. So there's a tradeoff.

to get 14 colours I unified four blues into two (-2), and two greens in one (-1) and two reds into one (-1). to get 13 colours I unified four blues into one (-3), and two greens in one (-1) and two reds into one (-1).

Here I used ALL the 959 tulips in the data, not just ground truth.

% monophyletic images distance/data 18 colours 14 colours 13 colours
pearson 21 18 16
euclidean 24 19 22
manhattan 29 25 24
kullback-leibler 32 22 24
jensen-shannon 33 28 23
earthmover 25 19 19
unweighted pairs (0.2/08) 28 25 21

Unsurprisingly, jensen-shannon and kullback-leiber, which really are designed for looking at differences between frequency distributions (they're very closely related information measures) do the best

Armand1 commented 4 years ago

Discovering the palette

Haven't we missed a fundamental problem here? Experts have a notion of how many colours there are --- we have the expert spectrum. But how many are there? We assume that 18 colours or whatever is right -- we are searching for them in an ad hoc way. Shouldn't we first just find out how many colours there are? Perhaps just cluster on the pixels of individual motifs using (I suggest) jensen-shannon and (maybe) spectral clustering?

I have tried this before --- but not using the "good" (non-book) images. I seem to remember that I got lots of dull colours. Now, using just the non-book images, we should be able to do it and get much better colours.

And then we discover combinations of colours: by discretizing say at the 5% level. That is, by saying, here is a list of colours that each motif has: blue, red, green etc. And we cluster on that combination of colour using gower distance. Since we have already said that two different shades of blue are fundamentally different, we don't have to use a color distance method. Of course, that means that tulips that have blue, red and green would cluster together regardless of proportions. But I think that is what matters: it's how many colours are used that matters, not their precise proportions. Does it matter, for example, whether you make a blue tulip with red stripes or a red tulip with blue stripes? I think not.

Armand1 commented 4 years ago

Given the current distribution of colours, it looks like a discretization of around 2.5% might be good. The vertical lines are F(frequency) = 0.01, 0.025, and 0.05. Set the discretization threshold too high and you lost rare, but important colours, such as reds in red dots.

Rplot04

gregiee commented 4 years ago

Shouldn't we first just find out how many colours there are? Perhaps just cluster on the pixels of individual motifs using (I suggest) jensen-shannon and (maybe) spectral clustering?

that would be a new approach, and if as you mentioned, we shouldn't care about proportions, then I'd think it would work quite well. Can you maybe find the code where you've done this before so I can run some tests?

on the other hand, I have sent you the palette using 20 instead of 12 colours per image and the new frequency data(corresponding to the new palette, and less missing motifs).

Armand1 commented 4 years ago

These are the correlations among the new 10.06.20 palette

Rplot10

Armand1 commented 4 years ago

The new palette is much better. Hardly any tulips with minority colours are missing them.

tulip groundtruth080620groundtruth

Armand1 commented 4 years ago

Here I combined two colours A09282 and BFC4C3 --- off white and grey white, which are clearly both white. I am sure that this is better

tulip groundtruth080620groundtruth

Armand1 commented 4 years ago

And here I removed all excess white below 20%. The reds still have a lot of white (this is because they are very small, for the most part and so have a high proportion of white)

tulip groundtruth080620groundtruth

Armand1 commented 4 years ago

Where do we want to cut off? There are 3 reds. To keep them in blue-green-reds and white_red_bluegreens we'd want to set the threshold at about 0.2

class colour_code q05 q50 q95
blue_green colour_3 0.01 0.03 0.04
blue_green_red colour_17 0.02 0.04 0.1
blue_green_red colour_3 0.02 0.04 0.1
blue_green_red colour_7 0.02 0.05 0.1
red colour_17 0.05 0.17 0.56
red colour_3 0.04 0.08 0.31
red colour_7 0.11 0.43 0.74
red_white_bluegreen colour_17 0.09 0.41 0.59
red_white_bluegreen colour_3 0.04 0.11 0.4
red_white_bluegreen colour_7 0.1 0.48 0.66
white_red_bluegreen colour_17 0.04 0.1 0.16
white_red_bluegreen colour_3 0.02 0.05 0.07
white_red_bluegreen colour_7 0.04 0.06 0.19
Armand1 commented 4 years ago

Spectral clustering does not help identify the ground truth (on continuous frequencies). This is the result from a typical run: it really is just, as before, grouping on the major colour distributions and splitting the classes

class 1 2 3 6 7 4 5
blue_green 9.8 23 18 49.2 0 0 0
blue_green_red 15.2 18.8 20.5 44.6 0.9 0 0
red 0 0 0 0 0 15.4 84.6
red_white_bluegreen 0 0 0 0 4.3 30.4 65.2
white_red_bluegreen 0 0 2.4 0 97.6 0 0