biocore / redbiom

Sample search by metadata and features
Other
45 stars 20 forks source link

Get representative subset for a metadata category #67

Closed cuttlefishh closed 1 year ago

cuttlefishh commented 6 years ago

In the EMP paper, we created a subset of 2000 samples with even representation across empo_3 categories (subset_2k) to make the dataset less biased toward certain empo categories and the trading cards more meaningful. It would be nice of Redbiom could do this too. One could imagine:

  1. choose a context, which will define the sample set
  2. choose a metadata category, e.g. qiita_empo_3
  3. choose the number of samples in the subset
  4. result: a list of samples evenly distributed across empo_3 categories; if some categories run out of samples, the remaining categories will be used to fill until the total requested is reached
  5. this set of samples would then be used, for example, to see which samples a given sequence is found in, and then compare that sample distribution (and its metadata) to the whole subset.
wasade commented 1 year ago

I'm going to close this as out of scope for redbiom. Redbiom is specifically tasked with storing, searching and fetching sample data and metadata. I hesitate to increase the scope of the project to include subsampling regimes, but rather I'd like to encourage a downstream tool to implement that.