I have thought a little bit about how we might do this. Some rough notes below:
The problem
Our data is likely to be skewed towards particular institutions that have gone all-in on IIIF/Europeana. If we do normal random sampling we will end up with many more items in the sample from those institutions. This would often be the desired outcome. i.e. if the Europeana data was our 'population' we'd probably want to generate a representative sample of that population. Since we are interested in knowing whether particular features
What we want
To oversample the smaller institutions (probably up to some threshold amount) so that our sample includes more examples from different institutions compared to a random sample.
Ideally, we want to do this in a way where we have some parameters we can plugin that isn't hardcoded so if the Europeana data changes we can use the sample sampling approach.
we maybe want a target sample size but we could also aim for a target fraction of the total data population
Possible solutions
Calculate the count of each institution of the data so we get something like:
Institution
Count
A
500
B
200
C
50
D
20
Total
770
Say we want a sample size of 200 in this case
Divide the desired sample size by the number of institution classes to get the 'ideal' proportion of each label.
200/4 = 50
For the classes where this 'ideal' is <= to the total number take all of the possible examples. In this example C, D.
Add up the number of items generated from this initial sample 50 + 20 = 70.
Take this from the desired sample
200-70 = 130.
Take this number and divide it by the remaining number of classes left to sample from in this case 2:
130/2 = 65
Again, where this 'ideal' is <= to the total number in that class take all of the possible examples for that class. In this example, this doesn't apply. If this step is done then repeat the calculation to get the desired sample size for each remaining class, in this case, it remains the same.
take 65 from B and A:
now we have 20 + 50 + 65 + 65 = 200.
I am no statistician (as it very obvious here). This may be simultaneously more crude and complicated than we need. I will try and read a stats book now...
I have thought a little bit about how we might do this. Some rough notes below:
The problem
Our data is likely to be skewed towards particular institutions that have gone all-in on IIIF/Europeana. If we do normal random sampling we will end up with many more items in the sample from those institutions. This would often be the desired outcome. i.e. if the Europeana data was our 'population' we'd probably want to generate a representative sample of that population. Since we are interested in knowing whether particular features
What we want
Possible solutions
Say we want a sample size of 200 in this case
Divide the desired sample size by the number of institution classes to get the 'ideal' proportion of each label.
200/4 = 50
For the classes where this 'ideal' is <= to the total number take all of the possible examples. In this example C, D.
Add up the number of items generated from this initial sample 50 + 20 = 70.
Take this from the desired sample
200-70 = 130.
Take this number and divide it by the remaining number of classes left to sample from in this case 2:
130/2 = 65
Again, where this 'ideal' is <= to the total number in that class take all of the possible examples for that class. In this example, this doesn't apply. If this step is done then repeat the calculation to get the desired sample size for each remaining class, in this case, it remains the same.
take 65 from B and A:
now we have 20 + 50 + 65 + 65 = 200.
I am no statistician (as it very obvious here). This may be simultaneously more crude and complicated than we need. I will try and read a stats book now...