davanstrien / IIIF-ML-experiments

1 stars 1 forks source link

Sampling potential images to put in the annotation queue #12

Open davanstrien opened 3 years ago

davanstrien commented 3 years ago

I have thought a little bit about how we might do this. Some rough notes below:

The problem

Our data is likely to be skewed towards particular institutions that have gone all-in on IIIF/Europeana. If we do normal random sampling we will end up with many more items in the sample from those institutions. This would often be the desired outcome. i.e. if the Europeana data was our 'population' we'd probably want to generate a representative sample of that population. Since we are interested in knowing whether particular features

What we want

Possible solutions

Institution Count
A 500
B 200
C 50
D 20
Total 770

Say we want a sample size of 200 in this case

Divide the desired sample size by the number of institution classes to get the 'ideal' proportion of each label.

200/4 = 50

For the classes where this 'ideal' is <= to the total number take all of the possible examples. In this example C, D.

Add up the number of items generated from this initial sample 50 + 20 = 70.

Take this from the desired sample

200-70 = 130.

Take this number and divide it by the remaining number of classes left to sample from in this case 2:

130/2 = 65

Again, where this 'ideal' is <= to the total number in that class take all of the possible examples for that class. In this example, this doesn't apply. If this step is done then repeat the calculation to get the desired sample size for each remaining class, in this case, it remains the same.

take 65 from B and A:

now we have 20 + 50 + 65 + 65 = 200.

I am no statistician (as it very obvious here). This may be simultaneously more crude and complicated than we need. I will try and read a stats book now...