colour data - Githubissues

Armand1 commented 5 years ago

The color histograms and means that we have are not ideal. They lose the information about the colors of individual pixels. This means that I cannot say that an individual pixel is "red" or "blue" and, by extension, I cannot say that an individual motif is 20% "red" and 80% "blue"

This matters, I think. The Iznik artists used a very limited palette: red, blue, green, turquoise, white, black and perhaps one or two other colors. But any one of these colors can have a variety of values depending on the quality of the image (how it was photographed) or perhaps even genuine variation in the pigments used (the precise shade of red). So I think we want to group all the pixels with a certain range of RGB values into a particular color and call it "red" or whatever the case may be.

My idea is that we should do this:

(1) get, for all images, the RGB values of all pixels.

(2) for tractability randomly sample a few million or so. There are 255^3 = 16,581,375 possible colors in the RGB system, but we'll have far fewer than that.

(3) put them all into some huge K means cluster analysis (or gaussian mixture model, perhaps after tSNE). We should find that they form a limited number of clusters, no more than 10 I would guess. Name the clusters by the central color (e.g., "red")

(4) Classify each pixel in each image by its color-cluster.

(5) Summarize the colors of each motif as fractions of pixels in each color-class. e.g., a given tulip is 0.8 blue and 0.2 red.

The virtue of this scheme is that it reflects the discrete palette of the Iznik artists. Also we can talk intelligibly about the colors of tulips from the numbers. Currently my summary numbers (e.g., mean RGB values) tell me that all motifs are basically some muddy color (shades of brown basically). I exaggerate slightly --- but not by much: here are the mean RGB colors of my tulips Rplot01

Armand1 commented 5 years ago

It's clear to me that there is fundamental problem in the images. They are taken under wildly different lighting conditions.

Here are two plates, and a sample of 1000 pixels from their segments. They are actually very similar, but look at the distribution of colour values: TLCM16_0000_a TLCM16 MET26_0000_a Met26

So, it's clear to me that, before we find any palettes, we need to standardize the image brightness --- luminance --- in some fashion. There are a pretty large number of dark images. This is why we can't get really bright colours in our palette: and why the colour clustering is proving such a headache.

So, either these dark images need to be removed from the dataset. Or else they need to be standardized to increase the brightness; or else all the images need to be standardized to some common luminance. However it is done, the palettes need to be found again.

Armand1 commented 5 years ago

Here I have extracted all the pixels from the total image (the "-a") files, sampled 1000 at random (it's a huge file), calculated the sumRBG of each pixel and extracted the top 10% for each image: the lightest pixels. These will tend to be the background, as can be seen by the fact that they're mostly beige or whitish. I have ordered them by mean sumRGB. I think that the background, under optimal lighting conditions, should be the same. But this shows the range of variation: which is considerable. Can this be used to standardize the RGB values? Rplot

Armand1 commented 5 years ago

This shows a violin plot for the for the full range of values for each of the images (well, 1000 randomly sampled pixels from each). Here the values are L for luminance which is highly correlated with sumRGB (as above)

Rplot08

Armand1 commented 5 years ago

Here is one of the darkest and one of the lightest plates. These are not the darkest and lightest, since there are some really dark images that have loads of blue and really are dark. But this shows the problem vividly since these plates are clearly related: they have exactly the same carnations for example TLCM09_0000_a

AM05_0000_a

Armand1 commented 5 years ago

I adjusted the distribution of the dark plate in the following way. I reasoned that the lightest 10% of pixels represent the background, which should be the same colour. So I obtained the mean sumRGB of those pixels for each plate and calculated a correction factor. I multiplied each pixel of the darkest plate by this correction factor, in effect, brightening the whole thing.

original colors: Rplot03 new colors: Rplot04

That seems to be an improvement, though it certainly isn't a great match. I tried this using luminance in the CIE lab scale as well, but that involved conversion back and forth and this is better since it preserves the distribution in RGB space

It's unclear whether you want to calculate the correction factor on just the brightest pixels rather than on an overall mean

Armand1 commented 5 years ago

Another way of improving matters, might be to in addition filter for just the brightest pixels overall. Here, for the brightest 66% after adjustment. This means that we ignore the darkest pixels in the darkest images.

Rplot05

In any event, you can see that we have brought the colours of these two images considerably into line. After doing so, we should THEN do Salim's colour clustering.

Armand1 commented 5 years ago

This is preliminary clustering on colours with adjustment and filters lookatnormalization_3_small.pdf

Armand1 commented 5 years ago

it is now clear to me that the main problem is the photographs that we took from books! There were a bunch of them. This is a plot of the normalization factor (based on the lightest 10%) pixels of each, relative to the lightest image. The books (Gulbenkian catalogue, Koc catalogue, Atsoy etc) are dark. I think the best thing to do is eliminate these from the clustering of the segments. They can still be used for the ML.

This will reduce our sample size by about half I think.

Rplot

Armand1 commented 5 years ago

This is a 30 colour palette from all images

Rplot01

This is a 30 colour palette from just images taken from the web, that is, not from books.

Rplot07

not very different. I think we should still do some pre-palette finding processing. Remove dark and light pixels and normalize brightness.

Armand1 commented 4 years ago

I have been looking at Yuchen's clustering results for tulips These are in his data: "nonbook-cluster-results.csv"

I have established several folders of ground truth tulips: classifying them into "blue_green", "blue_red_green", "red" and so on. They can be found here

https://www.dropbox.com/sh/tnwa929plsdayi6/AADn9y8p5j3uzFWaR6if6FmEa?dl=0

note that Dropbox is currently updating, so you can't see them all yet. But the full list can be seen here

tulip_colour_groundtruth.csv.zip

Now, I just looked at the colours in each of my ground truth groups (no clustering here). Just to see that they make sense. They should more or less: in "blue_greens" you should see shades of blue overwhelm and some green and not that much else.

But the whole thing is wrong. These are the "red" tulips --- they don't have any red!

reds

For example, MET28_0003_t.jpg

MET28_0003_t

Here is the full analysis

getting-colours-of-motifs.pdf

Now, I may have screwed up but, if so, I can't see it.

@Yuchen I suggest that you go back and look at the motifs to see if you also find that the inferred colours don't match. You can use the ground truth too.

gregiee commented 4 years ago

Had a quick look seems like there's mismatch between the palette csv and the code used in the frequency data. will look into it tmr

Armand1 commented 4 years ago

When you provide me with a new data set/s please label the files with the date so that we have version control.

I also have a sense that we do not have full control of the metadata and segmented images. For some reason I find a few more hundred "non book" segmented motifs than you did. I am going to generate new versions which, I hope, will be perfect. Will keep you posted.

gregiee commented 4 years ago

I have a look today and I couldn't see any obvious problem... Need to do more experiments to see what's going on... might take a bit Please do update on whatever you've found, might be useful.

gregiee commented 4 years ago

worst case scenario, I will have to rewrite the clustering part..

Armand1 commented 4 years ago

I doubt that --- Salim's palettes may not have been perfect but they did make sense relative to the images themselves. Do verify my result for yourself.

gregiee commented 4 years ago

I went through the frenquecy calculation function and It seems fine - not obvious steps that could mess up with the palette order. I have two suspects, one is that it has something to do with me breaking down the whole process into several batches(cause it takes up too much memory), two is that it has something to do with using all motifs to generate the palette but I am only analysing the tulips in the end, so the color won't 100% match. Groud truth is very helpful, I'll run some experiments and get back to you hopefully soon.

Please do lemme know if you could think of anything.

gregiee commented 4 years ago

seems to have found the issue: when I'm running Salim's code for assigning palette labels/mapping for all the motifs, it seems like some of them are messed up. see example AM03(1st one) and MET28(quite in the middle):

it seems like the mapping is quite off for MET28. I will have a closer look at this.

Armand1 commented 4 years ago

This is AM01 AM01

This is MET28 MET28

They're pretty similar. Both have lots of blue, green and red. It's unclear which is messed up. (It's also rather unclear which of the palettes refers to which)

gregiee commented 4 years ago

75645210-e18b0900-5c3c-11ea-9b45-e6ba729adf9b

the above is AM03. the first is the final 18 palette, the third is salim's clutering for the original photo, and the second is transformed from the original one using the 18 palette. it seems that for this one, the original palette is already a bit off.

75645206-df28af00-5c3c-11ea-92b4-7ed116af8363

same situation with the palette for MET28, altho the original palette seems a bit more okay, the mapping(from the third palette to the second) is very off.

Thanks for the input and I will have a look.

Armand1 commented 4 years ago

@gregiee So the palette that you showed looks reasonable. I think we should go with it.

But I now need (i) the frequency data; (2) the RGB palette; (3) to know wether that is an expert-biased or not. Why don't I know this? I need you to tighten up.

Armand1 / Iznik

colour data #1