Illogical results from redbiom summarize metadata-category

rahel31 commented 5 years ago

Hi,

I would like to use redbiom to check in which types of samples a taxon of interest has been observed the most. I used the command below with ctx = Pick_closed-reference_OTUs-Greengenes-Illumina-16S-V4-150nt-bd7d4d

redbiom search taxon g__Pseudomonas --context $ctx | redbiom search features --context $ctx | redbiom summarize samples --category sample_type >> Pseudomonas.txt

Thereafter I wanted to compare the numbers I got to the overall sample numbers per that sample_type redbiom summarize metadata-category --category sample_type --counter >> All.txt

To my surprise the a few categories like "vaginal", "Bulk water" and some more, had less counts in the overall counter than in the counter when a taxon was specified.

| Pseudomonas | Carnobacteriaceae | Micrococcaceae | All_samples vaginal | 651 | 240 | 396 | 297

Could this be due to the second command not taking into account the context - though then the numbers should be too big, not too small... Any idea what could be going on here and how to get the real "all samples"number so I could normalize for it?

Thank you in advance.

wasade commented 5 years ago

Hi @rahel31, I'm terribly sorry for such a delay here -- I didn't see this issue come through in email so it unfortunately slipped.

The summarize metadata-category as you note is not factoring on the context, and is only summarizing the sample metadata. One possible discrepancy is that samples could have been sequenced multiple times. This is reflected within a context -- sample metadata holds a one to many relationship with both bioinformatic processing and technical handling.

I'm investigating this issue right now though, and will follow up shortly.

wasade commented 5 years ago

Hi @rahel31, it does appear to be technical replicates driving the difference. See the output below.

First, recreating your observation:

$ ctx=Pick_closed-reference_OTUs-Greengenes-Illumina-16S-V4-150nt-bd7d4d
redbiom search taxon g__Pseudomonas --context $ctx | redbiom search features --context $ctx | redbiom summarize samples --category sample_type >> Pseudomonas.txt
$ grep vaginal Pseudomonas.txt
vaginal 651
vaginal mucus   35

Next, I'm pulling out just the sample IDs from your query:

$ redbiom search taxon g__Pseudomonas --context $ctx | redbiom search features --context $ctx > Pseudomonas_ids.txt
$ head -n 5 Pseudomonas_ids.txt
66115_10317.000102704
44738_2260.SA13009.ROB12mic.13
56280_11757.G441802031
48016_10317.000038384
45865_11116.A01B100.1198018
$ wc -l Pseudomonas_ids.txt
82837 Pseudomonas_ids.txt

The structure of these IDs are <qiita_artifact_id>_<qiita_sample_id>. More information about this structure can be found here. In brief, it is so we can support a one-to-many relationship between a physical sample and processing (e.g., replicates, multi-omic, etc).

If we strip off the artifact ID, and reassess the counts, we observe a count that is lower than the "All.txt" observation:

$ cut -d "_" -f 2 Pseudomonas_ids.txt | sort |  uniq | wc -l
78679
$ cut -d "_" -f 2 Pseudomonas_ids.txt | sort |  uniq | redbiom summarize samples --category sample_type >> Pseudomonas_unique.txt
$ grep vaginal Pseudomonas_unique.txt
vaginal 217
vaginal mucus   35

Hope that helps! And again, I'm terribly sorry about the delay in response.

rahel31 commented 5 years ago

Hi thanks for your reply! I already gave up this line of thought, but I think this a great and very much needed tool in the field and I will certainly try to use it in the future. Regarding this issue - I'm happy you found a workaround. If I understand correctly, then the qiita artifact id is always unique, but the sample id would be the same for technical replicates? And when you do the summarize samples --category sample_type , then it will count only the samples, not its replicates? Thanks!

wasade commented 5 years ago

...and I did it again, this notification slipped in email. And thank you for the kind words.

That is correct, the Qiita artifact ID is always unique. A sample ID is unique to a physical specimen, but is not unique to a preparation -- for example, a specimen may have technical replicates, or also go through 16S sequencing, metagenomics, metabolomics, etc.

Counting is not based on the preparation, and only based on the sample information. Which is what is I believe best in this situation, so technical replicates or multiple processings of a sample are not counted.

biocore / redbiom

Illogical results from redbiom summarize metadata-category #85