biocore / redbiom

Sample search by metadata and features
Other
44 stars 20 forks source link

Discrepancy between metadata search results & piped fetch results #125

Open nvpatin opened 1 year ago

nvpatin commented 1 year ago

I am trying to download a set of samples based on metadata information. When I search with my parameters, I find a certain number of samples; but when I pipe those results into 'redbiom fetch' (with a particular context) it downloads a different number of samples. I think there is a similar problem when I pipe the search results into 'redbiom summarize contexts'; it shows a list of contexts, some of which are associated with my samples but some of which are not, and I have to guess which one I have to use for fetching. So I have two questions: 1) How can I see the contexts associated only with my searched samples? and 2) How can I only fetch the samples associated with my metadata search? See below for the problems associated with question 2.

Looking for marine water samples within the EMP

% redbiom search metadata "where qiita_study_id == 13114 and empo_4 == 'Water (saline)'" | wc -l
39

Defining a context based on previous search results (it took several attempts to find one that worked)

% echo $CTX Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b

Fetching samples based on metadata and context

% redbiom search metadata "where qiita_study_id == 13114 and empo_4 == 'Water (saline)'" | redbiom fetch samples --context $CTX --output EMP_marine_samples.biom 38 sample ambiguities observed. Writing ambiguity mappings to: EMP_marine_samples.biom.ambiguities

Data summary shows many more samples than metadata search originally found

% biom summarize-table -i EMP_marine_samples.biom | head Num samples: 97 Num observations: 16,547 Total count: 1,354,853 Table density (fraction of non-zero values): 0.030

Counts/sample summary: Min: 4,111.000 Max: 38,769.000 Median: 12,268.000 Mean: 13,967.557

nvpatin commented 1 year ago

Update: I see that the list of samples found in the metadata search and the list of samples in the downloaded biom table do match, but the biom table seems to have sub-set the samples. For example, "13114.palenik.42.s001" in the sample list corresponds to the sample IDs "13114.palenik.42.s001.134469" and "13114.palenik.42.s001.134523" in the biom table. The sample IDs in the metadata table match the list of sample IDs in the biom table, but all the metadata values are identical within each sample "grouping", e.g. "13114.palenik.42.s001.134469" and "13114.palenik.42.s001.134523" have exactly the same metadata.

Is there documentation about how and why that sub-sampling was done? I guess I can combine sample replicates (if that's what they are).

antgonza commented 1 year ago

@nvpatin; thank you for the question and update. I think @justinshaffer might be able to answer your question.

wasade commented 1 year ago

Hi @nvpatin, sorry for a brief delay, I was OOO the last few days.

For (1), that is an excellent idea and is not currently something that is exposed to the user, but would be a great addition. I would be happy to propose a suggestion to do this via bash script or python as a stop gap.

For (2), the issue is that the same physical sample has been sequenced multiple times. The command shown is correct, but each individual sequencing run is differentiated. These "ambiguities" are expressed in the resulting ambiguity map. You can get around this by specifying --resolve-ambiguities with the call to fetch. For redbiom fetch samples, I usually do --resolve-ambiguities merge which combines the sample data from multiple runs together.

If you haven't seen it, there is a longer tutorial on use on the QIIME 2 forum.

nvpatin commented 1 year ago

Thank you @wasade that's very helpful! I will check back for future functionality that provides contexts associated with samples in the metadata search results.