Open Armand1 opened 4 years ago
HCA clusters: Within- and among-ground truth class distances.
One method we can use to test distances is to examine the average distance within ground-truth classes to the average distance among them. The bigger the average distance, the better method.
We can invent a "specificity" measure. It works like this.
specificity<-dist_among_classes/dist_within_classes
Specificity >1 means that the mean within-class distance is smaller than the mean among-class distance; and the larger this ratio the better
distance metric | Mean within-class | Mean among-class | Specificity |
---|---|---|---|
euclidean | 0.482 | 0.641 | 1.33 |
KL (symmetric) | 6.25 | 8.63 | 1.38 |
earthmover | 0.182 | 0.300 | 1.65 |
weighted pairs (freq 0.2, colour 0.8) | 1.02 | 1.29 | 1.26 |
So earthmover is doing best
HCA clusters: Rand index
Here we use each distance metric to construct a tree by HCA. We then cut the tree at k=6 since that is how many colour classes we have, to give us 6 clusters. We then use the adjusted Rand index to look at the correspondence between the original classes and those inferred by the HCA. We will cluster using Wardd2
distance metric | adj Rand |
---|---|
euclidean | 0.29 |
KL (symmetric) | 0.27 |
earthmover | 0.38 |
weighted pairs (freq 0.2, colour 0.8) | 0.34 |
So earthmover is doing best by a small margin
HCA clusters: dendrograms are informative. Here the nodes have been colored by the most frequent colour.
Euclidean
Cluster very much on the dominant colour. And it's got a junk cluster (cluster 2 at the top)
colour distance earthmover
Earthmover is nicer --- but it really separates out the blues into different clusters
colour distance weighted pairs (freq 0.2, colour 0.8)
The nice thing about this method (which heavily weights towards colour) is that it groups all the blues together. It's the only method that does so. It doesn't group the sub-groups of blues very well (it's still grouping dominantly by the precise shade of blue, which cuts across the a priori groups).
HCA clusters: colors. We can also look at the distribution of colors among clusters. Here they are for earthmover and weighted pairs
earthmover
colour distance weighted pairs (freq 0.2, colour 0.8)
This the ground truth colors
Actually, this illustrates why it's fucking hard to separate out the blue groups --- they're really not that distinct.
It seems to me, looking at particular cases, that the ground truth could be improved. But I have to say that the palette seems to be missing some things. See below.
GMBC09_0000_t
Here's an example of tulip that is the blue_red_green ground truth class. It clearly has red dots. But if you look at its palette --- it doesn't have any red. Is it getting sucked into the black? This also applies to GMBC09_0002_t. So, in this case you think it's not picking up reds in the entire image due to the internal clustering. But, with the saz leaf included, there's quite a lot of red.
VA58_0003_t This image is a bit puzzling. It looks very red, but there's not much red in the palette. I checked the image itself -- nothing obviously wrong with it. Here it is using the whole image and not the masked tulip. There are two other tulips on the same plate which aren't in the frequency data at all
VA08_0001_t perhaps shouldnt be in blue-green class. Not much green. remove from blue-green gt This applies to VA08_0000_t, VA08_0002_t, VA08_0003_t, VA08_0004_t as well. They're all exactly the same
Things we have talked about.
1) Some missing motifs. Where are they? Why are they not in the data? How many are missing? 2) At least some missing motifs are because the mask are not working, rather the entire image is being analysed. Find and fix. 3) We are missing some colours in some motifs. This appears to be due to minority colors. Perhaps increase the K at the second round of clustering to pick them up. (This may have a cost downstream in giving less accurate 18 colour palettes) 4) We probably want to reduce the number of blues and so on manually by looking at the expert palette and coming to a decision as to which "blues" really are the same. 5) I will look at the "expert" palette and think about what "blues" and so on to combine. 6) I will continue to look here at the "anomalous" ground truth motifs --- where their colours don't seem to be right.
problematic motifs --- where the palette does not correspond to gt classes
One thing that strikes me is that, anyway you cut it, there is not a big difference between the colours of "blue_red" and "blue_red_green" classes. I think they should be combined into a single class. That said, there are a number of tulips that really don't have green but in which grays are registering as green. Increasing the palette may solve this.
There are a number of other changes to the ground truth that should be made
segment_id | class | problem | solution |
---|---|---|---|
VA58_0003_t | red | has far too little red | there is a motif finding problem |
FMC16_0001_t | blue_red | has green in the palette and in the calyx | move to blue_red_green |
MET40_0000_t | blue_red | has green in the palette and some in the tulip | increase palette |
MET40_0001_t | blue_red | has green in the palette and some in the tulip | increase palette |
LOU03_0007_t | blue_red | has green in the palette and some in the tulip | move to blue_red_green |
LOU03_0008_t | blue_red | has green in the palette and some in the tulip | move to blue_red_green |
LOU01_0001_t | blue_red | has green in the palette but none in the tulip | increase palette |
MK05_0001_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
MK05_0003_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
MK05_0004_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
MK05_0006_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
SHM03_0000_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
SHM03_0001_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
SHM03_0002_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
SHM03_0003_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
SHM03_0004_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
SHM03_0005_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
SHM05_0006_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
VA25_0010_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
VA31_0000_t | blue_red | has green in the palette but very little in the tulip (more grey) | increase palette |
FMC10_0000_t | blue_red | has green in the tulip but some in the palette | increase palette |
FMC10_0001_t | blue_red | has green in the tulip but some in the palette | increase palette |
GMBC20_0004_t | red_white_bluegreen | has no real blue in either the tulip or the paletted | move to red |
GMBC09_0000_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
GMBC09_0001_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
GMBC09_0002_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
GMBC10_0000_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
GMBC10_0001_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
GMBC10_0002_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
GMBC10_0003_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
MET29_0000_t.jpg | blue_red_green | has no red in the palette but does in the tulip | increase palette |
MET29_0001_t.jpg | blue_red_green | has no red in the palette but does in the tulip | increase palette |
MET29_0002_t.jpg | blue_red_green | has no red in the palette but does in the tulip | increase palette |
MET29_0003_t.jpg | blue_red_green | has no red in the palette but does in the tulip | increase palette |
MET33_0000_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
MET33_0001_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
MET33_0002_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
VA64_0000_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
VA64_0001_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
VA64_0002_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
VA64_0003_t | blue_red_green | has no red in the palette but does in the tulip | increase palette |
BEN01_1_0002_t | white_red_bluegreen | has no red in the palette but does in the tulip | increase palette |
BEN01_2_0002_t | white_red_bluegreen | has no red in the palette but does in the tulip | increase palette |
VA08_0000_t | blue_green | the calyx is really blue | move to blue |
VA08_0001_t | blue_green | the calyx is really blue | move to blue |
VA08_0002_t | blue_green | the calyx is really blue | move to blue |
VA08_0003_t | blue_green | the calyx is really blue | move to blue |
VA08_0004_t | blue_green | the calyx is really blue | move to blue |
GMBC08_0001_t | blue_green | there is a lot of green in calyx; but green is not being picked up | increase palette |
BM54_0011_t | blue_green | there is very little green in the image; none captured in the palette | move to blue |
Missing tulips
In the ground truth we have 296 tulips. Of these only 275 are present in the frequency data. Where are the rest? I am guessing we are missing MANY tulips!
Update. We have a new groundtruth groundtruth090620.csv containing 349 motifs: every motif in here has frequency data now.
New ground truth This is a new groudtruth file
groundtruth080620.csv
It discards a few and has a "blue_green" class as before and "blue_green_red" class instead of separate "blue_red" and "blue_red_green" classes.
Using weighted distance pairs, as before, the Rand is 0.32 and Specificity is 1.20 which really isn't an improvement. The blue_green and blue_red are being distributed between the two blue classes and the red_white and red_white_bluegreen are not differentiated --- though there are a few big subclusters too
class | cluster_1 | cluster_2 | cluster_4 | cluster_3 | cluster_5 |
---|---|---|---|---|---|
1 blue_green | 46.4 | 21.4 | 32.1 | 0 | 0 |
2 blue_green_red | 36.1 | 33.3 | 29.6 | 0.9 | 0 |
3 red | 0 | 0 | 3.8 | 0 | 96.2 |
4 red_white_bluegreen | 0 | 0 | 4.3 | 10.9 | 84.8 |
5 white_red_bluegreen | 2.6 | 2.6 | 5.1 | 89.7 | 0 |
simplifyng the palette
clustering the 18 colour palette across all segmented motifs (jensen-shannon; wardD2, suggests that the following colours might be usefully unified. (in red)
Another way of testing distances and palettes: Image monophyly
This borrows a trick from phylogenetics. It asks, for some dendrogram, if all tulips of a given image are more closely related to each other than they are to any other. If they are, then they are monophyletic. They could also be paraphyletic (something else is found within them) or polyphyletic (split between two sub clusters). In general, we should prefer things to be monophyletic rather than polyphyletic.
Reducing the colours tends to make things less monophyletic. This is because they are less distinguishable. So there's a tradeoff.
to get 14 colours I unified four blues into two (-2), and two greens in one (-1) and two reds into one (-1). to get 13 colours I unified four blues into one (-3), and two greens in one (-1) and two reds into one (-1).
Here I used ALL the 959 tulips in the data, not just ground truth.
% monophyletic images distance/data | 18 colours | 14 colours | 13 colours |
---|---|---|---|
pearson | 21 | 18 | 16 |
euclidean | 24 | 19 | 22 |
manhattan | 29 | 25 | 24 |
kullback-leibler | 32 | 22 | 24 |
jensen-shannon | 33 | 28 | 23 |
earthmover | 25 | 19 | 19 |
unweighted pairs (0.2/08) | 28 | 25 | 21 |
Unsurprisingly, jensen-shannon and kullback-leiber, which really are designed for looking at differences between frequency distributions (they're very closely related information measures) do the best
Discovering the palette
Haven't we missed a fundamental problem here? Experts have a notion of how many colours there are --- we have the expert spectrum. But how many are there? We assume that 18 colours or whatever is right -- we are searching for them in an ad hoc way. Shouldn't we first just find out how many colours there are? Perhaps just cluster on the pixels of individual motifs using (I suggest) jensen-shannon and (maybe) spectral clustering?
I have tried this before --- but not using the "good" (non-book) images. I seem to remember that I got lots of dull colours. Now, using just the non-book images, we should be able to do it and get much better colours.
And then we discover combinations of colours: by discretizing say at the 5% level. That is, by saying, here is a list of colours that each motif has: blue, red, green etc. And we cluster on that combination of colour using gower distance. Since we have already said that two different shades of blue are fundamentally different, we don't have to use a color distance method. Of course, that means that tulips that have blue, red and green would cluster together regardless of proportions. But I think that is what matters: it's how many colours are used that matters, not their precise proportions. Does it matter, for example, whether you make a blue tulip with red stripes or a red tulip with blue stripes? I think not.
Given the current distribution of colours, it looks like a discretization of around 2.5% might be good. The vertical lines are F(frequency) = 0.01, 0.025, and 0.05. Set the discretization threshold too high and you lost rare, but important colours, such as reds in red dots.
Shouldn't we first just find out how many colours there are? Perhaps just cluster on the pixels of individual motifs using (I suggest) jensen-shannon and (maybe) spectral clustering?
that would be a new approach, and if as you mentioned, we shouldn't care about proportions, then I'd think it would work quite well. Can you maybe find the code where you've done this before so I can run some tests?
on the other hand, I have sent you the palette using 20 instead of 12 colours per image and the new frequency data(corresponding to the new palette, and less missing motifs).
These are the correlations among the new 10.06.20 palette
The new palette is much better. Hardly any tulips with minority colours are missing them.
Here I combined two colours A09282 and BFC4C3 --- off white and grey white, which are clearly both white. I am sure that this is better
And here I removed all excess white below 20%. The reds still have a lot of white (this is because they are very small, for the most part and so have a high proportion of white)
Where do we want to cut off? There are 3 reds. To keep them in blue-green-reds and white_red_bluegreens we'd want to set the threshold at about 0.2
class | colour_code | q05 | q50 | q95 |
---|---|---|---|---|
blue_green | colour_3 | 0.01 | 0.03 | 0.04 |
blue_green_red | colour_17 | 0.02 | 0.04 | 0.1 |
blue_green_red | colour_3 | 0.02 | 0.04 | 0.1 |
blue_green_red | colour_7 | 0.02 | 0.05 | 0.1 |
red | colour_17 | 0.05 | 0.17 | 0.56 |
red | colour_3 | 0.04 | 0.08 | 0.31 |
red | colour_7 | 0.11 | 0.43 | 0.74 |
red_white_bluegreen | colour_17 | 0.09 | 0.41 | 0.59 |
red_white_bluegreen | colour_3 | 0.04 | 0.11 | 0.4 |
red_white_bluegreen | colour_7 | 0.1 | 0.48 | 0.66 |
white_red_bluegreen | colour_17 | 0.04 | 0.1 | 0.16 |
white_red_bluegreen | colour_3 | 0.02 | 0.05 | 0.07 |
white_red_bluegreen | colour_7 | 0.04 | 0.06 | 0.19 |
Spectral clustering does not help identify the ground truth (on continuous frequencies). This is the result from a typical run: it really is just, as before, grouping on the major colour distributions and splitting the classes
class | 1 | 2 | 3 | 6 | 7 | 4 | 5 |
---|---|---|---|---|---|---|---|
blue_green | 9.8 | 23 | 18 | 49.2 | 0 | 0 | 0 |
blue_green_red | 15.2 | 18.8 | 20.5 | 44.6 | 0.9 | 0 | 0 |
red | 0 | 0 | 0 | 0 | 0 | 15.4 | 84.6 |
red_white_bluegreen | 0 | 0 | 0 | 0 | 4.3 | 30.4 | 65.2 |
white_red_bluegreen | 0 | 0 | 2.4 | 0 | 97.6 | 0 | 0 |
This summarizes results for testing various distance methods, and clustering methods, for the motif colour distributions.
The colour data are: allfreq_palette960620.csv The palette is: completely unsupervised, based on non-books, and slightly enhanced. It is palette060620.csv The ground truth data are: groundtruth060620
These results are based on 282 tulips that have been assigned to 6 ground truth classes