Open Armand1 opened 5 years ago
It might be possible to avoid this clustering step if there is some standard lookup table that classifies RGB values into standard "basic" colors --- of which there are said to be 11 or so. e.g.,
Experimenting with clustering shows that the color clusters aren't bad. But it's clear that noise in the color values sometimes causes tulips that are "the same", that is, are apparently similar and in the same artifact, to fall into different clusters. This is why, I think, we want to discretize the pixel values first into a handful of colors
I established some ground truth by (1) identifying all tulips that are mostly blue with some red --- call them blue-red. Blue-reds make up around 305 or 15% of the total number of tulips. The best clustering solution (GMM, k=7) puts 77% of them in a single cluster. But about 50% of the tulips in that "blue-red" cluster are not blue-reds; I imagine they're mostly just blues. It seems to be splitting the blue reds by saturation --- and some of the blue reds are assigned to different clusters with a high probability (so they're not just low-probability stragglers)
I also (2) counted, just by looking at them, the number of different kinds of tulips in about 100 images. That is, whether I thought there were 1, 2 or 3 types. I then compared these numbers to the number of colour tulip clusters found in each image. In the best solution there was a correlation of r=0.74.
So, the color clustering is not doing terribly, but I think it can be improved by discretizing the color values. It's clear to me that there is a lot of variation we'd like to get rid of. Here are some examples of blue-red tulips that, I think, we'd like to have in the same cluster on the basis of color values --- though some have green and others do not.
The difficult step, for me, is getting the RGB pixels values. If you were to output them for me, one csv file per motif, I could take it from there.
So I have managed to do this. Or at least start it. Using the imager library in R, I was able to upload a sample of the motifs, extract the rgb values for all pixels, and export them as a single df. I managed to do this for 50 tulips. After that I ran out of memory.
I then clustered the rgb values. This was a moderate success --- they're not not nearly so clumped as one might suppose. I think that I want around 20 color clusters. Fewer tends to produce greys. I then assigned each pixel of each of my 50 tulips to a cluster. I estimated the median rgb value of every cluster and gave it that Hex color. So, in effect, I am reducing the large number of observed colors to a manageable palette.
Below, is an example of what I got. This isn't perfect. But I think, with saz leaves and carnations included, we will begin to get a workable, justifiable, "Iznik palette". That said, my idea that there is one blue, one red etc is clearly not workable --- at least not making color clusters in some non-statistical way. But maybe this won't matter. For example, if you look at the two blue tulips below, they have different proportions of the several shades of blue --- but they are clearly related to each other relative to the others.
My idea is to group the palette into ``sub-palettes'' --- particularly frequent combinations of colors using Topic Analysis. I think this might work. We might do so by images.
We might estimate a color "diversity" metric per motif --- give us a sense of how complex the dots and stripes and things are.
I think that all of the artifacts, together, have 500-750 million pixels. That's quite a lot.
Progress! I have managed to extract RGB files for all 2000+ tulips (and can do the rest of the motifs). For the sake of tractability, I randomly sampled 10% of the pixels from each image (after having removed all the whites (255,255,255). This leaves me with about 4m pixels in a single data frame.
I think I want a single clustering solution for all motifs --- one palette. But I don't want to do the artifacts themselves since the most common color is the background (ground) which is various shades of white-grey-cream and is a nuisance to get rid of, dominating the clustering solution.
I have switched to density based clustering with noise (DBSCAN) as it seems to give me the most sensible color clusters, better than Gaussian Mixture modeling or kmeans. it takes about 15 minutes on my computer to get a solution for 4m pixels.
Sounds great! I have read through what you have done so far and agree with your observations. I thought histograms might help, because they give you size-invariant features of color information, but as I said in another issue, I did not have time to play with them yet. I now have a better idea about what you are trying to achieve with the "palette". The total number of RGB values per motif seems to be the biggest problem so far. What you did makes sense (re: sampling). What I can suggest is including image quantization to make sampling more uniform. The idea is simple. We run k-means on each image, which presumably will group areas with similar colors. Then a weighted random sampling is done on the image based on the within-image clusters. This would help better represent every color group in an image, and provide a better sampling than randomly getting pixels without a priori looking at their colors.
Then we do the clustering for all motifs, just like what you did above.
Can you post here the final features you fed into DBSCAN? There are various clustering algorithms in Python that can be run directly off the shelf. I will do my own analysis, but can also run a clustering script on your features. Perhaps we may find a better clusterer.
Image quantization is a good idea. However, it may not be possible with the RGB data set that I have. That is because (perhaps foolishly) I figured my computer would not be able to handle the full dataset, that is, every pixel of every motif, so when getting the RGB values for the pixels I sampled a random 10%. The resulting data frame is ~10m rows x 8 cols; all pixels would be 100m rows --- a lot.
The 10% file, when compressed, is 78MB --- too big for GitHub evidently. I shall send it to you via mail drop and see if that works.
Currently I am having difficulty good clustering solutions for all the motifs. I am finding that there's a huge, dark, cluster that just sucks up lots of other colors. It appears that its continuous with the rest, so DBSCAN is having difficulty picking up discontinuities. I may need another strategy.
Meanwhile I shall extract ALL pixels --- 100m --- from all images.
I have tried many clustering approaches. The big problem is that the pixels are very continuously distributed in RGB space. This shows how (random sample of 500k pixels)
DBSCAN does not do a good job since it likes discontinuities. Bright pixels tend to get grouped with a morass of dark pixels and you just get a few clusters of mud.
My best idea so far is a two-step solution.
First, cluster using KMEANS with k=500. This gives you many clusters of similar size. The problem is that many of the clusters have essentially the same values. But that doesn't matter: we're just using KMEANS to reduce the space.
Then, cluster the clusters (on their mean RGB values) using HCA. Cut the tree at say k=15. The result is that you get a reasonable looking palette: the many large "dark" clusters are reduced in number, but fewer, smaller, "bright" clusters such as emerald green and red are preserved. I have yet to test this approach against actual images.
This is close to your solution, of clustering within motifs, first. That's more complex and I haven't tried it yet.
Following the preceding I have made some progress. To recap: I get all 100m pixels for all images. Then I do kmeans with k=1000. And then I reduce the mean RGB values of these 1000 colors further with HCA and cut the tree to get 30 colour clusters.
Importantly, I realized (I think this is right) that kmeans does not use the density of points (unlike DBSCAN), but only their position in feature space. So, all I have to do is cluster on every unique combination of RGB values, that is, colors in the dataset. There are about 1m of them (out of 16.5m possible). So, having clustered them, I can unambiguously assign one of my final clusters to every one of the 100m pixels in the original dataset. There is then no need for subsampling within motifs of the sort you proposed.
Having done that we can check whether the inferred palette of each motif makes sense in terms of the original image. Here I plot the histogram palettes for 20 images next to the images themselves. (If I were smarter I would be able to plot each image inside of its histogram --- actually, if R's plotting packages were smarter, since I have tried hard to do this and I think it's possible, but very complicated). Anyway, the results are not too bad! Perhaps a final palette of 30 colors is too many --- but as you start to reduce them you start to lose what I think is real diversity in colour among the images. Perhaps 20 colors would work better.
Anyway, what do you think? My idea is that these 30-colour-palette distributions would be the basis for clustering the tulips themselves into "color-styles" Then we might find that "blue-red" tulips will group together (think emoji). In fact there is a blue-red tulip in this set. Most of the blue is dark purple and goes to a dark colour.
You might complain that I have not justified my choice of K except by looking at the result. That's true. But I don't think that looking at silhouette scores or anything are going to be very useful (and even with some of the juiced up big-data algorithms I'm using prohibitively tedious). We just have to accept that we are not discovering "true" clusters, but just quantizing the colour space in a way that allows us to speak readily about the colors of particular motifs.
If you think that you can do a better job on this, please feel free-- I'll happily abandon my solution. But I think I am getting close to a color-palette solution.
Good progress. Two points that confuse me:
First, I have checked the file you sent and realized that it is a data frame of pixels as data points, not motifs. So what you have been essentially doing all alone is pixel clustering, rather than motif clustering? In my mind, I thought each motif is being represented by a fixed number of R, G, B values, roughly accounting for the 10% of all its pixels. Obviously one motif might have N pixels, while another might M, so the maximum number of pixels to be sampled should be fixed a priori (ideally to the size of the smallest motif) to avoid dimensionality mismatch across motifs. The image quantisation would only then make sense.
Second, when you do clustering, do you only use "red", "green", "blue" as your features, or you also include "x" and "y"? If the latter is True, the reason for having the clusters shown in the picture might be due to spatial locations of pixels dominating the clustering against the color features. Plus, if there was no normalisation, it might have also affected the resulting groups.
Having checked the pdf, I would say I am impressed with how well the histograms match the images. I think, while we may change or keep the clustering method above, the end goal is now clearer in my head and the palette-solution just makes sense. I agree that "finding a right K" is not our task and we should aim for a K as large as possible to be able to represent every singly color group in the dataset.
I cannot claim that I will do a better job, but I will at least try and come back to you with my own results if they make sense to me first.
That's correct: first pixel clustering to get a "palette", then, having got a reduced palette for each motif, I will cluster motifs. The idea is that, by using a standard palette across all motifs, I will be able to better group motifs that are "blue and red" even if the actual RGB values for the "blue" and "red" bits are quite different. Whether this actually works remains to be seen (I am trying it now on tulips).
I do not use the "x" and "y" -- only the RGB features.
This question of normalization when clustering pixels is interesting. I have still found cases where two motifs --- two carnations --- have very similar areas of "green" and "red" but, even after applying my palette, they still have different colors --- since my palette has several shades of "green" and "red". That's because one carnation is just generally darker than the other. So when I cluster the carnations, I'll bet I'll find dark carnations and lighter carnations. Of course I can use a smaller palette --- but in my experience I just turn all my colors to mud.
Perhaps such groups --- lighter and darker carnations --- are real. More likely they're an artifact of photography / dirt etc. If the latter it seems to me that we first want to "normalize" the intensities in some manner. Is this "histogram equalization"? Your thinking would be appreciated. (I feel I am reinventing bog-standard image processing techniques).
Of course when clustering motifs I will normalize by the total number of (non-white) pixels, so that I cluster motifs on the proportions of their palette clouds
This is an initial look at ordinating the tulip motifs (about 4k) by tSNE. This is based on the frequencies of the colors in the 30-palette solution.
The points are colored by their most frequent colour. You can see that it does a pretty good job of grouping the motifs by that criterion. There's a reddish cluster, yellowish etc, and a big blue cluster.
But I have also plotted the "blue-reds". They are the large triangles. You can see that they fall in the big blueish cluster. But they don't group together. And, frankly, I don't think that any amount of hammering at the colour-clustering will make them do so. They have fundamentally different shades of blue (or dark grey) which dominate their position. This isn't an artifact. But they still have something in common that others don't --- those red dots against a blueish background. How to pull them together, or even if we should, is an interesting question. Perhaps it means introducing a derived variable such as "polychrome"
Just a quick answer regarding normalisation: What I meant was to subtract a column's (say red) mean from all its values and then divide it by their standard deviation, which will effectively bring all features to zero mean and unit variance. This is a well-known trick to get a better clustering/classification, but is only useful when we have many features in various ranges. Since in our case we only have three features, all of which have the same range, normalisation may not have a huge impact.
Histogram equalization might be useful, especially with artificially darker/brighter motifs (i.e. got affected by poor image capturing), since it will adjust image intensities. You may want to try this out, especially with motifs of the same color but different shades as a last effort to see if it will help.
So the one useful thing that I have done this week is to invent a measure of color diversity within motifs. This was motivated by the fact that all my clustering solutions of motifs (see above) are dominated by the most frequent colour: if a tulip is basically blue it goes with blue tulips regardless of whether or not it has red spots. This is actually understandable: the red spots are rather small.
Anyway, in an effort to compensate for this I did the following. For each motif, I estimated the median euclidean distance in RGB space of all pixels. The idea is that, in motif that is basically one colour (shades of blue), this metric would be small, but in a multicolor motif it would be larger since some of the distances would be large, as in between a blue and red pixel.
The results are pretty promising, though I've only looked at the extremes of the distribution. (My computer collapses if I try to print more than 8 or 9 motifs --- the image data gets loaded into a single data frame and I don't know how to reduce it). examine_color_distances.pdf
What do you think about this?
I have spent much of the last two days on the color data. I will lay my notes here before I forget them. Some results will follow throughout the day, depending on my schedule.
First of all, most of the insight I gained is complimentary of yours. There is nothing much new. In the end I had a bit of reinventing the wheel feeling, but not at all do I regret the time I have spent so far. I should say how fascinated I am by the complexity of the color data. I did not realise this before getting my hands dirty. What you have been telling me on this page has just made a lot more sense.
I had sit at my computer with the plan of applying some clustering algorithms on the RGB values you have sent, but I ended up conducting some messy analysis from scratch. I first needed to see why histograms won't work and why single-level clustering is a bad idea. I also wanted to try some standard data normalisation and cluster number predefinition in action.
The following notes might feel like jumping from one branch to another, but this is basically what I did pretty much while trying to build a working end-to-end pipeline. I cannot say I have achieved one yet, but work is still in progress.
1 - I ended up noticing how hard to cluster RGB values, given different pixels might be apart by the same distance in the Euclidean space. Therefore a transformation was a must. I used rgb_to_hex
and rgb_to_hsl
conversions, respectively for sorting and clustering pixels and converting back to RGB just before visualising the palettes.
2 - I wanted to do a motif-level clustering, not pixel-level, because I still believe histograms should be able to subdivide the motif space. The problem of not knowing the number of clusters struck here, but soon I realized no matter how many clusters the division is done into, we need to do a second level clustering within each group, because there is always motifs with different patterns (stripes, dots etc) being assigned to a cluster due to their color distribution being similar to the rest. Here, picking a large one, would be sufficient to move onto the next level. K-means was OK in my endeavour, and I don't see any necessity for a fancier clusterer because all we need is an initial, basic groping.
3- To make things easier at the beginning, I assumed that each image in the dataset is a singleton, for instance the motifs in AM01 are a cluster. Long story short, I ended up first clustering the histograms into 8-16 distinct colors, most of which represent shades of the same color. At the next level, I only used the center RGB value for each group, and after converting them to HSV space, used hierarchical clustering to group the shades of certain colors into 4-5 new clusters. Had I applied clustering directly on the image to reach those 4-5 clusters, I would have ended up a palette dominated by the relatively large colors, potentially merging distinct ones into one group, as can be seen below.
4- A separate note on the definition of a "palette". I always thought it would be a group of distinct RGB values, and as a rule of thumb, noticed that for a single motif 3-4, for an image 6-8, and for an image cluster 16-20 colors would be sufficient for a unique representation. I think you believe 30 clusters would be enough to subdivide the space, but how many colors do you think should be used to represent each individual cluster?
5 - After convincing myself that a multi-level clustering is a must, I am now running the clustering on all images. Initial results showed that there are always outliers in each cluster that need extra attention, but due to the hierarchical manner of separating colors, we end up representing them in their cluster palettes. I think separating these outliers from a cluster is the most important problem now and I will see what I can do about it next.
I need to stop here, but I will try to keep posting as long as I have time. Please feel free to provide any feedback.
One more thing I forgot to add is regarding color distances within segments. I believe it might work and also be useful for the last point I mentioned above. It seems to work to separate mono-color and multi-color motifs, which are from the same cluster (can you please confirm)?
Re the colour distances:
Yes, that's exactly right. It's clumsy, but my idea was that if I have a cluster of segments that is basically blue (because they're dominated by blue), then at least I could separate them into "monochrome" and "polychrome" blue segments. So I still have that information. And my "mono-polychrome" index is indeed bi-or tri-modal. (basically monochrome, that is, made up of very similar colors in RGB space, a bit polychrome, very polychrome). It kinda works. I suppose that the same logic could be applied in HSL space too.
But it's a bit clumsy. All my colour clustering of segments has yielded semi-sensible results. But in each case I find that within some image, (say AM04), which has several blue and green tulips, at least one or two tulips end up in different segment clusters. That's because my reduced palette crunches them to a subtly different shade of blue (royal blue, or midnightblue) or whatever. Sometimes you can see why (the artist applied the paint lightly, so it's a lighter colour blue --- but often the visual difference is so subtle that you think "that's not good."
OK --- what you're doing sounds great! But, to be clear, you're currently just aiming at getting a better palette for each of the segments? You're not clustering the segments themselves --- our ultimate aim.
The reason I ask is because I am trying something a bit exotic: "biclustering". do you know it? Briefly, this is clustering on both rows and columns. It allows you to pull out subsets of observations that have associated features.
For example, imagine we have segments that are blue+yellow and blue+green and the rest are other colors. In regular clustering (in which all observations are clustered simultaneously) it would be hard to find what blue is associated with --- since it's associated with two other colors. In biclustering it should be able to pull out a blue+yellow and blue+green cluster. Something like that. it's much used in gene expression studies where any one gene can be associated with different sets of other genes in multiple pathways. (Forgive me if you know all this --- it's new to me.)
I'll continue to work on that with my 30 palette dataset since if and when I can figure it out, any data --- any palette --- can go into it.
OK - I am going to stop working on clustering until I see what you come up with w.r.t to the colour palette! The biclustering packages are not giving me what I want and are very difficult to work with. (They're giving clusters of segments which are unified by the absence of colors!) gah.
So, I now better understand why you are first finding a palette within an image. We can presume, in general, that all ``blues" in a given image --- regardless of which segment/motif they are in --- are "the same" --- even if pixel, by pixel, they are immensely diverse. Thus, if you assume that every image has only 6-8 colors, and assign them to each segment/motif as appropriate, then at least you can guarantee that all segment/motifs of an image will have the same limited set of colors. That will ensure that, when we cluster segment/motifs into larger groups, at least those in the same image will be in the same cluster. Currently not the case!
And THEN, you'll cluster the colors of all the images again and reduce the palette further. So "royal blue" from one image and "midnightblue" from another become just a single "blue". Clever!
You said: "I think you believe 30 clusters would be enough to subdivide the space, but how many colors do you think should be used to represent each individual cluster?"
I am not sure that I can answer this question. I am not sure I understand it. The only information that we have is that the traditional scholars will tell you that the Iznik artisans used a very limited palette. Blue, red, etc --- maybe 7 colors. They will sometimes say that "emerald green" and "turquoise" were added on later. But I think that they have a very insensitive, qualitative, view of the Iznik palette. I think it's much more diverse than that. I thought 30 colors. But the problem was that my palette was unbalanced -- many shades of blue and a single muddy red (if I was lucky).
To return to your question: some (many?) segment/motif clusters will be monochrome. Remember, we have some essentially monochrome images (all blue). And there are lots of small, whitish, tulips and small, greenish, saz leaves. But a minority of segment/motif clusters will be polychrome.
Does that help?
Thanks for the insights in general. I still don't have anything to show today but would like to clarify some things up re: your latest posts:
1- By "segments" you refer to the ultimate clusters of motifs, is this correct? Given the current progress we have made, we are aiming for 30 segments, hence 30 clusters, hence a palette of 30 entities? I preferred to use entity instead of color, because what I see as an entity in a palette is a set of N distinct colors, in which I think N should be around 6-8. Forgive me if sound arrogant and have just given a stupid definition to a palette, but I am trying to think it merely as a set of features, each of which consists of some colours and uniquely represents a "style".
2- Quoting from you:
OK --- what you're doing sounds great! But, to be clear, you're currently just aiming at getting a better palette for each of the segments? You're not clustering the segments themselves --- our ultimate aim.
The idea is first to cluster the dataset (the motifs), then for each cluster find a set of colors that represent all their members and finally further reduce this set of colors to a distinct "entity" (see above for what I mean here). So clustering is still the ultimate aim. But the mono/poly-chrome motifs will constitute a problem, so at this stage we can separate them from their respective clusters (using your diversity metric) and try to find them either a new cluster or append them to an already existing (but different than its original) cluster somehow (I am just thinking out loud here, not sure how to achieve it, but it should be possible)
3 - biclustering: I knew the concept but nothing more. Could be a bit of complicated to apply on color data, but if we fail in the end by other means, we can give it a go.
4 - Quoting again:
So, I now better understand why you are first finding a palette within an image. We can presume, in general, that all ``blues" in a given image --- regardless of which segment/motif they are in --- are "the same" --- even if pixel, by pixel, they are immensely diverse. Thus, if you assume that every image has only 6-8 colors, and assign them to each segment/motif as appropriate, then at least you can guarantee that all segment/motifs of an image will have the same limited set of colors. That will ensure that, when we cluster segment/motifs into larger groups, at least those in the same image will be in the same cluster. Currently not the case!
That was my initial plan (a bottom-up clustering from motifs to images and to clusters), but the problem is, we have to find a way to directly use those 6-8 colors per image in a clustering setup. I started with a single image and gave you some results from that experiment, because I used it as a proof of concept. Still, the currently considered top-down procedure is highly promising. I will try to get more insight on this but I am glad we are on the same page.
I fear we are running into a little linguistic confusion. Just to clarify: I use ``segment" and "motif" interchangeably as, I thought you do, to designate a single, unique, ornamental thing: one tulip, eg., AM03_000_t. We have about a 2000 unique tulips, our task is classify them in a rational fashion, that is, find clusters. So
By "segments" you refer to the ultimate clusters of motifs, is this correct? Given the current progress we have made, we are aiming for 30 segments, hence 30 clusters, hence a palette of 30 entities?
Seems to misunderstand in several ways. I have no idea how many clusters of tulips there are: more than 10, surely, fewer than 200 perhaps. How do we get them? The main information will be colour. So how to do this?
we are aiming for 30 segments, hence 30 clusters, hence a palette of 30 entities? I preferred to use entity instead of color, because what I see as an entity in a palette is a set of N distinct colors, in which I think N should be around 6-8.
This seems a complicated way of thinking about it. By "palette" I mean a simple, flat, vector of unique colors. I think it works like this: having got, for each image, a palette of 6-8 colors, you combine them (find the union) across all images. Maybe you'll find that this union set of image-color palettes has 100 colors or so. That's too many: too many shades of blue, each of which is found in different images. So, hierarchical cluster again: reduce them to say 30 colors, your final, unique, palette: a few, real, shades of blue, green etc. This palette represents our vision of what the Iznik artists actually used, it captures most of the colour variation among all the motifs. What then?
but the problem is, we have to find a way to directly use those 6-8 colors per image in a clustering setup.
That seems easy to me. But I can only explain it in terms of pixels.
1) In tulip x we find an RGB(0,76,153) pixel -- dark blue. In tulip y we find an RGB (51,51,255) pixel -- another blue. By successive rounds of clustering, within images (twice), and then among images, we assign them both to one colour: RGB(0,0,255)--blue. In this manner, all pixels in all images, are assigned to one of 30 colors in our final palette. I imagine that all the tulips shown here will get just two colors: a single dark blue and a red. (Or maybe two blues?)
2) Then we determine the frequency of each colour in each motif. In tulip x, 0.99 of the pixels are blue, in y 0.05 are. So, tulip x is nearly all blue, y has just a spot of blue. We cluster on those frequencies --- easy to do now since our data are just a 30x2000 matrix and the colors are nicely defined. chances are that tulips x and y will end up in different clusters. So exactly like the data given in this image --- just better via your multi-level clustering.
As an added bonus, I'll bet that we can dispense with my ''colour distance'' measure. The blue-redspot tulips will end up together as subcluster of the blue-tulip cluster since we've simplified all those shades of blue. Nice!
Perhaps you see, now, why I wanted to work with pixels: I can understand how to use them to get color-frequency information for each of our motifs. I don't know how to do that using histograms! It seems to me that this information is lost. But I concede that I may be wrong since I don't really understand histograms.
Of course, if you can't use histograms, you can always apply your multi-level clustering method to the pixels. The key is your brilliant insight: bottom up, multilevel, rather than topdown multi-level which is what I attempted and which gives results that are not terrible, but that, frankly, are not good enough.
One question I have is: how do you determine how many colors to have per image? For some 6-8 will be good, but others are effectively monochrome (all blue). And for such images you don't want 6-8 shades of blue.
One solution is something like this. For any image, keep clustering, in successive rounds, until there are no longer any colors more similar to each other than some distance in HSV space. What distance? That I am not sure: something like < 10th percentile of the total possible space. But you'd have to try some rule. I am sure you have thought of this.
I fear we are running into a little linguistic confusion. Just to clarify: I use ``segment" and "motif" interchangeably as, I thought you do, to designate a single, unique, ornamental thing: one tulip, eg., AM03_000_t. We have about a 2000 unique tulips, our task is classify them in a rational fashion, that is, find clusters. So
Ah my bad, I don't know what I was thinking. Of course, segment = motif (In brain parcellation (my PhD research), we refer to parcels as segments, which constitute clusters of voxels. That must be a little trick played by my mind.)
For the rest of your explanation, thank you! It makes more sense now! Apologies for not coming back to you until now, it was just one of those busy weeks. I hope to work on the clustering on the weekend and get some results.
One question I have is: how do you determine how many colors to have per image? For some 6-8 will be good, but others are effectively monochrome (all blue). And for such images you don't want 6-8 shades of blue.
Can't say that for now, it was just my gut feeling. What we can hope is that, the monograms, all of which will have different shades of the same color will be merged into a larger group when we do the bottom-up clustering. I will think about it more as I go through the clustering pipeline. What you suggest can also be adapted, but again cannot say much about it before seeing it in action.
I asked Melanie Gibson, our Iznik expert, how many colors, in her opinion, we should aim for. She looked at the range that characterize images that have been attributed to different dates. Summarizing, she suggests:
black-cobalt blue bright-cobalt blue lilac blue turquoise sage green, emerald green, a turquoise green purple black orange red, several kinds of red
She points out that each of these come in several tones. But I think they are primarily intra-image variation. I attach her full analysis Iznik colours.pdf
That's actually extremely valuable information. However, it is also a very hard target to achieve through unsupervised clustering, and if we know the ultimate palette at least roughly as in this case, why bother with clustering all data at all? We can simply define color descriptors per image/motif through some sort of basic clustering and then find the closest color(s) to each descriptor by referring to the iznik colors table.
Can you or Melanie define the RGB ranges of the above colors? We can also use it (somehow) to validate the final palette obtained through our clustering approach.
Having said that, I have made some progress with the bottom-up technique. The method in general seems to work, and not to heavily rely on the definition of the number of clusters (at any stage) and gives out nice hierarchical palette representations (at least to me). I have started from the motif level and first represented each motif with 8 colors, which is more than enough to cover all distinct colors in a motif, indeed it is usually too many (but this is irrelevant because different shades are merged as we progress). Second, We reduce the initial 8 color representation to 4 and append each 4-color vector to a list, which constitutes our image-based color map. These are saved in first_58_original_colors
in three ways, original, rgb-sorted, and hsv-sorted. I prefer to look at rgb-sorted, because it gives me an idea about how much a color is repeated with different shades in a map. The size of the map differs very much from image to image, mostly due to having varying number of motifs per image. For instance BM05 is represented with only 8 colors while BEN05 has more than 300, but mostly shades of blue and gray, which is again not a problem, because most of these shades are merged into single clusters, thanks to the power of hierarchical clustering.
Color map of BEN05 (with 300 colors)
Reduced color map (here only 4 colors shown, but for the clustering I am using 12, to ensure all colors are included)
Last stage is to combine all reduced color maps into our final Iznik palette. To show you again how hierarchical clustering works at this last stage, I am sending palettes from 8 to 32 colors in first_58_final_clustering_palettes
.
Looking at the 20-color palette for instance, we have captured different shades of blues, few (bluish) greens, some red/oranges, some gray tones, a light purple, and some browns. It is important to note that these are only based on 58 images (all tulips), due to an error I had during processing. I will fix it and run it for all images, but in the meantime it would be great to get your feedback.
I do not think that Melanie's information eliminates the the need for an unsupervised estimation of the colour palette. Bear in mind, her view is conditioned by a tradition of qualitative scholarship which, while certainly insightful, may be incomplete. To put it another way: I suspect that, where they have discovered important colour variation those variants really exist (e.g, between emerald and sage green); however, they may well have missed important colour differences that we can discover quantitatively (e.g, shades of blue?). So, we should view her analysis as a guide rather than a "target'.
I can't immediately get RGB values since she did not tie particular images or motifs to her colors. But I can ask her to do so (she is, however, traveling and that may take a while).
Looking at the palettes you sent, my inclination is prefer the high N (>27) Only there, for example, does a bright green pop out (Melanie's "emerald green"?). But what surprises me is that we then also get a proliferation of dark colors that are visually indistinguishable (marked with circles below). Yet I thought that this is exactly what HCA would eliminate.
Is this just telling me that human vision is a poor guide to distance in RGB space?
I do not think that Melanie's information eliminates the the need for an unsupervised estimation of the colour palette. Bear in mind, her view is conditioned by a tradition of qualitative scholarship which, while certainly insightful, may be incomplete. To put it another way: I suspect that, where they have discovered important colour variation those variants really exist (e.g, between emerald and sage green); however, they may well have missed important colour differences that we can discover quantitatively (e.g, shades of blue?). So, we should view her analysis as a guide rather than a "target'.
I see your point. That makes more sense now. Let's, as you said, keep it as a guide on the side.
Looking at the palettes you sent, my inclination is prefer the high N (>27) Only there, for example, does a bright green pop out (Melanie's "emerald green"?). But what surprises me is that we then also get a proliferation of dark colors that are visually indistinguishable (marked with circles below). Yet I thought that this is exactly what HCA would eliminate.
I am afraid without over-engineering it is hard to propagate the distinct colors appearing in early stages towards upper levels in the hierarchy. For instance, green is mostly being merged with other blues (because they are similar in terms of coordinates in the 3D space), hence not staying as a "clear" green anymore. Similarly, the circled dark colors are possibly combinations of some early-stage shades of blue. When compared to the process of clustering the entire space directly into N colors (single-level clustering), the multi-level HCA definitely gives a more accurate partition and saves most of the distinct colors seen in various motifs, yet is not powerful enough for a scholar-level palette definition. Also, there is the problem with the human vision as you pointed out. Here is what the above palette (n=32) looks like when colors are not sorted.
Still dark colors seem similar when looked carefully, but in general, it gives the feeling of a more distinct separation.
Well, frankly, in that case I think let's go for a 32 or so colour palette that still has some bright colors. Then, once we've got the frequencies of the colors in the individual motifs, we can cluster them on those frequencies, and then see what we've got. I'm actually quite optimistic that we'll find large numbers of motifs, from different images, clustering together neatly. And, bear in mind, even if there are two clusters of ``dark-cobalt blue tulips'', where the dark cobalt has subtly different values, they should group together at a higher level in an HCA of motifs. The whole point is that the language of colour variation is impoverished relative to the numbers.
I totally agree. This will lead us somewhere and based on my research in the last days, I have now two promising solutions to try out over the week.
1- The main problem is that how we perceive colors is not reflected by how colors lay in their RGB space. I always thought conversions to HSL or HSV would be enough, but turns out there is even a better color space called CIELAB. Quoting from wikipedia "It expresses color as three values: L for the lightness from black (0) to white (100), a from green (−) to red (+), and b* from blue (−) to yellow (+). CIELAB was designed so that the same amount of numerical change in these values corresponds to roughly the same amount of visually perceived change." Using the CIELAB space we can get a single value, i.e. deltaE, to represent the distance between two colors. See wiki for more. I believe having moved everything to LAB and using deltaE would give use the "bright" colors we were looking forward to get.
2- We can make the HCA semi-supervised by incorporating domain knowledge. If we can define ranges of colors based on Melanie's list, we can use this information to impose constraints into the relationship of colors prior to applying hierarchical clustering. Say if we tell the algorithm (through a binary connectivity or affinity matrix, representing their pairwise relationship) that only combine two colors into one if they are "mergeble" (i.e. their entry in the affinity matrix is 1). In this case we will avoid contaminating a cluster with other colors and can potentially achieve separate clusters for blue and green, instead of having a single cluster of greenish blues.
That sounds great. I will be fascinated to see how you implement semi-supervised HCA, which is quite new to me.
Melanie has identified images with particular colors she thinks are important. In each image I identified those colors from her description, and got a single rgb value for each. There are 18 colors. Note, she does not pay attention to "ground" (background) whitish colors. Here are the colors. Note some colors (e.g., blue_1: bright cobalt blue) appear in many images; others (e.g., green_5: pale green) appear in very few.
Further details are here canonical_colour_values.pdf
You can see my version of Melanie's raw data in the Iznik GitHub code repository in a datafile called "canonical_colour_values_Iznick.csv"
And the summaries of those colors (median rgb values and their hex codes) in "summary_canonical_colours.csv"
Looks great! I have also obtained a new set of palettes at different resolutions using the LAB space. The colors are now much brighter and we can identify new ones, such as different shades of greens and yellows. I am sending you the results acquired with regards to the same setup before, so you can compare RGB- vs LAB-based palettes. Also see 18- and 32-color ones below. Please note that these are not constrained with the above canonical color values and are purely for your judgement on the different color spaces.
very nice! A vast improvement. I wonder if we will need the canonical colors at all. Perhaps they are better reserved as an independent check? (In some way)
Sure. We can identify both constrained and unconstrained palettes and see how much difference exists.
@Armand1 , hope you are doing well. I have finally had some spare time to look into the "constrained" palette and converted your expert palette into numbers for reference purposes.
As discussed above, I implemented the constrained clustering algorithm by creating a graph by connecting colors whose nearest neighbor is the same one amongst the 18 colors provided by the expert (and converted by you into RGB values).
I have also changed the criterion in hierarchical clustering for combining two colors into one (a technical detail which in the end gave closer results to the expert palette).
Please see the results in different representations for better perception (un-sorted, and sorted based on R, G, and B channels). I have stopped clustering at n_clusters = 18 to be aligned with the number of colors in the expert palette.
Original output and expert palette:
Sorted:
I am rather surprised how close we can get to the expert palette even without imposing constraints.
Good to hear from you! Fascinating! There's a philosophical question that we face now: do we want to use the expert-constrained palette or some other?
On the one hand, I like the idea of using an expert-constrained palette.
1) It may make our approach palatable to other experts (I think, assuming they buy our expert's assessment!). 2) It stands a better chance of discovering traditional relationships.
On the other, using it
1) diminishes the purity of our "let the data speak" rhetoric.
2) forbids us from using the expert palette as independent verification for our "let the data speak" approach
3) means we won't discover subtle relationships that our expert does not know.
What do you think? Is there room for using two palettes? One: an expert constrained palette of 16 colors and Two: an unconstrained palette of, say, 30 colours. I suggest this with reluctance. First, you have to generate two sets of data with two palettes. I suspect that's not so difficult for you. But it means that I have to do all downstream analysis twice... And I don't yet know how hard that will be. Still, it's a thought...
From a technical point of view it is just a matter of changing a parameter to obtain a constrained or unconstrained palette, on the other hand, as you pointed out, this decision might shape how we would like to "sell" this work.
If this was a technical manuscript, I would suggest having the unconstrained one as our main palette but discuss its similarity to the expert-constrained and the expert palette when writing down the findings/results. However, I am super ignorant regarding how to shape a paper around this as I have never been involved in such a high level study. I would just follow your lead and whatever you think makes more sense, let's go with that.
I am yet to clean the clustering code and run it for other motifs. Once it is done it would be as easy as clicking a button for me to generate a palette for as many colors as desired and in a constrained or other ways.
Well, if it is so easy, then let's do three things.
1) expert constrained 16 colour 2) unconstrained 16 colour 3) unconstrained 30 colour
That would enable us to explore what they tell us. Comparison of the first two lets us assess our method relative to experts, and even incorporate experts. The last enables us to potentially discover new things that experts don't know. We don't have to do everything in all three ways. But once we've got them, we've got them and can choose. How does that sound?
Sounds like a good plan to me!
Dear Armand,
I have finally finished cleaning the code and re-run the experiment from scratch for each three motifs and for a combination of all. The palettes are for the three cases mentioned above. I did not know which 2 colors to remove from the expert palette, which has 18 colors for the time being, hence kept the lower resolution as 18 but not 16. If you tell me which colors can be omitted, I can rerun the experiments in a minute.
There exist both sorted and unsorted versions, in case you find the former easier for comparison. Let me know how you feel about this or if you notice anything weird in the palettes.
Dear Salim,
Thank you very much for this.
So, I thought that there were 16 expert colors, but evidently I was wrong — if there are 18 let’s just stick with that.
Looking at the palettes, I have to say I find it very hard to judge which to use. I do believe that we should, for simplicity, use a single palette for all motifs.
i would also suggest that, we get the data — if it’s easy — for 3 palettes: 18 all constrained, 18 all unconstrained, 30 unconstrained.
I think I will be inclined to focus analysis on one, and then try to show that alternate palettes don’t yield substantially different results.
So, the question is: now what? I think you want to give me, for each motif, the frequencies of each color in each palette. We understand that correctly, right? I think that basically comes down to 3 csv files — one for each palette, motifs in rows, colours in columns, and frequencies as values.
Does that make sense?
best and thanks
A
On 5 Aug 2019, at 10:43, Salim Arslan notifications@github.com<mailto:notifications@github.com> wrote:
Dear Armand,
I have finally finished cleaning the code and re-run the experiment from scratch for each three motifs and for a combination of all. The palettes are for the three cases mentioned above. I did not know which 2 colors to remove from the expert palette, which has 18 colors for the time being, hence kept the lower resolution as 18 but not 16. If you tell me which colors can be omitted, I can rerun the experiments in a minute.
There exist both sorted and unsorted versions, in case you find the former easier for comparison. Let me know how you feel about this or if you notice anything weird in the palettes.
final_palettes.ziphttps://github.com/Armand1/Iznik/files/3466935/final_palettes.zip
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Armand1/Iznik/issues/1?email_source=notifications&email_token=ACCLRXYDBC6CI7F6F2AB433QC7Y5NA5CNFSM4HVQWERKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3RI7CY#issuecomment-518164363, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACCLRX3VZI77RTA4UFA3WHTQC7Y5NANCNFSM4HVQWERA.
Dear Armand,
I have only now had a chance to look at this due to being remote most of the last week. If you could clarify the following point, I can generate the requested CSVs for you.
When computing frequencies, the simplest solution would be to first match a pixel in a motif to a color in one of the palettes (e.g. nearest neighbor) and increment the matched color's counter by one. However, one flaw is that, even the nearest neighbour match of a pixel might not be directly represented in the palette. Say a yellow might be matched with a red shade if there is no yellow (as we perceive) in the palette.
What do you say about this? Am I missing something?
hi Salim,
I think that's the wrong way to do it.
What I imagined is that you re-generate the palettes, keeping track of every pixel's identity in the hierarchy as you go until you reach the final palette. This is how I described it above.
https://github.com/Armand1/Iznik/issues/1#issuecomment-505809355
Is that clearer?
FROM ABOVE
In tulip x we find an RGB(0,76,153) pixel -- dark blue. In tulip y we find an RGB (51,51,255) pixel -- another blue. By successive rounds of clustering, within images (twice), and then among images, we assign them both to one colour: RGB(0,0,255)--blue. In this manner, all pixels in all images, are assigned to one of 30 colors in our final palette. I imagine that all the tulips shown here will get just two colors: a single dark blue and a red. (Or maybe two blues?)
Then we determine the frequency of each colour in each motif. In tulip x, 0.99 of the pixels are blue, in y 0.05 are. So, tulip x is nearly all blue, y has just a spot of blue. We cluster on those frequencies --- easy to do now since our data are just a 30x2000 matrix and the colors are nicely defined. chances are that tulips x and y will end up in different clusters. So exactly like the data given in this image --- just better via your multi-level clustering.
Perhaps you see, now, why I wanted to work with pixels: I can understand how to use them to get color-frequency information for each of our motifs. I don't know how to do that using histograms! It seems to me that this information is lost. But I concede that I may be wrong since I don't really understand histograms.
I must have missed/forgotten about this, therefore I have taken a different approach when doing clustering. The thing is, I initiate clustering directly on the motif level and take into account the histograms, rather than considering each pixel individually (my first stage is also based on k-means). Therefore it won't be as straightforward as it sounds to do the required changes and link each pixel to a color from the palette. That being said, it is still possible. I will need to analyse the algorithm and see what needs to be done. Sorry about the misunderstanding.
Just wanted to give a heads up. I have implemented the missing bit on the weekend and let the algorithm run over the night, but it was killed due to lack of memory. I will free-up some memory and re-run it tonight when the machine is not being used by any other memory-consuming job in the background.
Dear Armand,
Please see the following zip that, hopefully, includes everything you asked for. 4 CSV files for the palettes (including the expert) and 3 for the frequency of colors in each motif (a total of 4078). Here are two examples of what individual pixels (~10K from a motif) are assigned to after successive rounds of clustering:
I am afraid I cannot send you the pixel results because it kept crashing whenever I wanted to keep it in memory (it has more than 22M entries).
In the CSV files you will see that there are entries from 0 to 17 (or 0 to 29 for the 30 color palette) which represent the number of times (frequency) a pixel is assigned to a palette color and their respective percentages (divided by the sum of total umber of pixels in a motif). For instance freq_i refers to the ith color in the palette.
The motif IDs (e.g. AM02_2) should be on par with the individual motif files I had sent before, but there might be some missing ones due to some checks I did during processing of them. I know this is not ideal but this was what I could do in the limited time I had before leaving for holiday.
Here is a look at the three palettes for a few selected tulips
18 constrained: 18 unconstrained 30 unconstrained
So they all look great --- and very similar. Interestingly, the 30 colour palette does not add a whole lot of blues. The average tulip has about as many colors-- 6 or 7 -- as all the other palettes. So that's good.
There is one curiosity: in all cases, there is no red assayed in AM5_003_t (bottom). But it's clearly visible in the tulip itself.
Dear Armand,
I had a chance to look at them now. Looks really good. Yes, re: 18 vs 30, you mostly get different shades in the latter.
Re: AM5_003_t. I don't have a clear answer, but my suspicion is that it might have disappeared at one of the clustering levels. It also contains a slightly different shade of red (compared to the one right above it, for instance) and if you check its bins closely you might match it with the color of the 1st and 5th bin, or my perception is playing games with me now...
The color histograms and means that we have are not ideal. They lose the information about the colors of individual pixels. This means that I cannot say that an individual pixel is "red" or "blue" and, by extension, I cannot say that an individual motif is 20% "red" and 80% "blue"
This matters, I think. The Iznik artists used a very limited palette: red, blue, green, turquoise, white, black and perhaps one or two other colors. But any one of these colors can have a variety of values depending on the quality of the image (how it was photographed) or perhaps even genuine variation in the pigments used (the precise shade of red). So I think we want to group all the pixels with a certain range of RGB values into a particular color and call it "red" or whatever the case may be.
My idea is that we should do this:
(1) get, for all images, the RGB values of all pixels.
(2) for tractability randomly sample a few million or so. There are 255^3 = 16,581,375 possible colors in the RGB system, but we'll have far fewer than that.
(3) put them all into some huge K means cluster analysis (or gaussian mixture model, perhaps after tSNE). We should find that they form a limited number of clusters, no more than 10 I would guess. Name the clusters by the central color (e.g., "red")
(4) Classify each pixel in each image by its color-cluster.
(5) Summarize the colors of each motif as fractions of pixels in each color-class. e.g., a given tulip is 0.8 blue and 0.2 red.
The virtue of this scheme is that it reflects the discrete palette of the Iznik artists. Also we can talk intelligibly about the colors of tulips from the numbers. Currently my summary numbers (e.g., mean RGB values) tell me that all motifs are basically some muddy color (shades of brown basically). I exaggerate slightly --- but not by much: here are the mean RGB colors of my tulips