Get representative subset for a metadata category

In the EMP paper, we created a subset of 2000 samples with even representation across empo_3 categories (subset_2k) to make the dataset less biased toward certain empo categories and the trading cards more meaningful. It would be nice of Redbiom could do this too. One could imagine:

choose a context, which will define the sample set
choose a metadata category, e.g. qiita_empo_3
choose the number of samples in the subset
result: a list of samples evenly distributed across empo_3 categories; if some categories run out of samples, the remaining categories will be used to fill until the total requested is reached
this set of samples would then be used, for example, to see which samples a given sequence is found in, and then compare that sample distribution (and its metadata) to the whole subset.

biocore / redbiom

Get representative subset for a metadata category #67