Open Armand1 opened 5 years ago
That actually makes sense. I think what you desire is to first separate motifs based on their shapes, so a red and blue tulip can be in the same group initially, if they have similar shapes, and then do a second level of clustering purely based on color. This might also help untangle the points in the high dimensional space, because when you plot them, it is usually a large spider web, without any obvious differentiation between groups. Hierarchical clustering would help in this sense. This is what I used to decide how many clusters we should have for images.
For the color info, it is a bad idea to use only one value. We should either use all pixels, after reducing dimensionality through PCA, or stick to histograms (but you say it does not work either). Quantization of colors would work, we can actually use 8 or 16 colors instead of 256. One way I know is using kmeans to cluster the image and then reconstruct it with the mean value of clusters. I will give it a try. I will also run a quick analysis on motifs, similar to what I did with images. Unfortunately most of my time went to generating the motif features and I could not have time to play with the data (best part). This is on my to-do list now.
For shape, I will check the other issue and edit this accordingly. I cannot name any sophisticated features for shape without checking the literature first, but basics like morphological features should give us an idea.
Don't work on colors --- see my "colors" issue --- I am making good progress with this. (Actually, the great thing is that I am learning how to work with images in R!) But information about the shape descriptors would be great.
Please see my notes on shape descriptors in https://github.com/Armand1/Iznik/issues/2.
What are your views on clustering AFTER tSNE? There's some discussion on web pro and con....
I understand that tSNE can give you purely artifactual clusters if you hammer it enough. But I have to say I am struggling a bit to get interesting solutions on the raw data or PCA transformed.... so I am tempted. So, legit or avoid and use for illustration only?
It seems to me that we want to cluster based on both the similarity of the colours and their relative frequencies. It so happens that there is an R package colourdistance that does exactly this. In fact, it works directly from images -- binning colours: and, had I known about it, we might have used that directly (except that we could not have done our palette constraining).
In any event, I take the RGB space and use a distance metric "weighted.pairs" which allows you to specify how much weight to give to colour similiarity versus their relative frequency. The relative weight makes quite a lot of difference, but in general weighting of 0.5 to each works well. A test gives excellent results using bog-standard HCA on a selection of tulips.
What I want you to notice is that the tulip motifs, within each image, are grouped together. The exception is AM03 and AM04 -- which both have blue tulips (albeit a slightly different spectrum of blue) with green bits. So that's just as it should be. This is using the 18 constrained palette. Also note that the "red spot" tulips of AM05 and GMBC02 are very close -- but not exactly in the same cluster since the latter has green as well. This is the first time that we have got results as good as this.
The downside of this approach is that we can only use clustering methods that take a dissimilarity matrix as an input -- so that seems to forbid gaussian mixture models which give you an optimal number of clusters via BIC. I think we're restricted to kmeans and HCA. I think I will try the latter, and try to get bootstrap support or something for clusters.
That looks really promising. We might use this for motif-grouping, but to my understanding, it cannot be applied at earlier stages, due to not having frequency information of motifs (to get the frequencies we should first know how the palettes are like). Please correct me if I am wrong though. Otherwise, I like it.
On Sun, 8 Sep 2019 at 14:11, Armand1 notifications@github.com wrote:
It seems to me that we want to cluster based on both the similarity of the colours and their relative frequencies. It so happens that there is an R package "colourdistance" https://cran.r-project.org/web/packages/colordistance/vignettes/color-metrics.html that does exactly this. In fact, it works directly from images -- binning colours: and, had I known about it, we might have used that directly (except that we could not have done our palette constraining).
In any event, I take the RGB space and use a distance metric "weighted.pairs" which allows you to specify how much weight to give to colour similiarity versus their relative frequency. A test gives excellent results using bog-standard HCA on a selection of tulips
[image: inital plot] https://user-images.githubusercontent.com/8698079/64488687-87d3f400-d242-11e9-892d-d31ca2a2ff73.jpg
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Armand1/Iznik/issues/3?email_source=notifications&email_token=ACF26C6N25GWQQALI2FYT4LQIT2YXA5CNFSM4HVR5LK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6FPPZY#issuecomment-529201127, or mute the thread https://github.com/notifications/unsubscribe-auth/ACF26C3HBQ26ZY6GYNWDIE3QIT2YXANCNFSM4HVR5LKQ .
That's right --- it's only for motif clustering
Testing: data:groundtruth tulips palette: tulips_18_constrained_greyremoved.csv distance: colourdistance method: HCA wardD adjustedRandIndex: 0.16 k=5 performance: poor
Testing: data:groundtruth tulips palette: tulips_18_constrained.csv distance: colourdistance method: HCA wardD adjustedRandIndex: 0.24 k=5 performance: better
Testing: data:groundtruth tulips palette: tulips_18_constrained.csv method: GMM with PCA details: EVV k=6 adjustedRandIndex: 0.18 performance: poor
details: EVV k=5
adjustedRandIndex: 0.15
performance: poor
details: VEV k=12 -- this is the best solution without PCA adjustedRandIndex: 0.17 performance: poor
So GMM does a really poor job as well against ground truth. For some reason, using the greyremoved data causes the clustering here to collapse entirely.
data:groundtruth tulips palette: tulips_18_constrained.csv method: 1-cosine similarity adjustedRandIndex: 0.24 performance: better
In all cases it's clearly pulling things together based on the majority shade of blue (eg., FMC16, FMC12, FMC16) while ignoring whether they have red in them or not
data:groundtruth tulips palette: tulips_18_constrained, simplified blues, greens and reds method: 1-cosine similarity adjustedRandIndex: 0.34 performance: better
Now, we have basically two groups: mostly white and mostly blue. Some classes are split
data:groundtruth tulips palette: tulips_18_constrained, simplified blues, greens and reds; tIDF applied to upweight rare colours method: 1-cosine similarity adjustedRandIndex: 0.25 performance: no improvement
I decided to go back to basics and look at the palettes:
This is the Expert (Canonical) palette compared to the 18 constrained palette The colours are just clustered in CIE lab space (converted from the RGB values)
So what are we losing? Bright red, some resolution in the turquoise/acquamarine; pale purple and pale green. some differentiation in the turquoises etc.
This is the Expert (Canonical) palette compared to the 30 unconstrained palette
which distance metric to use?
I suppose that Yuchen simply used euclidean distance when doing his HCA. But there are many alternatives. Which is best?
One possibility is Kullback-Leibler distance. It's good for looking the distance between probability distributions. Here, our probability distribution are the proportion of pixels for each colour for each segmented image. It's implemented in philenthropy (R)
Another are the distances implemented in the R package colourdistance https://github.com/Armand1/Iznik/issues/3#issuecomment-529201127 which takes the distance between the colours into account.
I think we should try both these distances and see which works best using bog-standard HCA and some arbitrary number of clusters (say 10)
How do we establish which distance metric (or clustering method) is best?
This is a difficult problem. We need to establish (i) distance metric; (ii) clustering method; (iii) number of clusters. But how can we do so without knowing the true number of clusters?
One way might be by establishing a data set with ground-truth. These would be groups of tulips, from the same or different images, that we think are "the same" --- visually. For example, blue tulips with red spots are common. As are blue tulips with red spots and green sepals.
But this is not as easy as it sounds. The blues and reds vary and there are tulips with intermediate colours. So I think we need to find a set --- or several sets --- that we really believe are similar
We want to cluster motifs. My current idea, based on preliminary experiments, and interpretability is that we want to cluster motifs using separate shape/size and color models. The idea is this: there are a bunch of shapes and that colors are independent of them. So the idea is to get, say, 10 shape clusters for tulips and then 10 color clusters. Combine them and you have potentially 10x10 =100 tulip styles. In fact, fewer than these will be observed.
Do you think that this makes sense? I am clustering using Gaussian mixture models. (I have tried putting it through tSNE first, but that does not help).
For the color clusters I am using only the mean values --- throwing in the whole histogram or the min or max does not appear to improve performance any, in fact tends to cause the GMM to collapse. This gives me 9 clusters. However, there are still quite a lot of tulips, from the same image, that go into different clusters when they really shouldn't. (for example, the tulips within AM03 tend to go into 2 clusters when they shouldn't. The same is true of AMO4.) I think that this is further evidence that we should try to group pixel values into discrete "colors" and then cluster the motifs.
For the shape variables I am using those listed in the previous issue. Currently my shape clustering is not performing well (aside from the fact that I need to normalize the size variables by total motif size)