Meta-analysis with EukDetect and some questions

sarpiens commented 1 year ago

Hello thanks for developing this interesting tool!

I'm preparing a meta-analysis, and I wanted to explore the eukaryotic component of the microbiome with EukDetect. However, after reviewing the documentation and the forum I have some questions:

a) readlen parameter: I'm not sure if this is an internal filter (if it is, I haven't seen it indicated in the tool diagram) or if we should filter out readings smaller than 75 bp in the QC step before using the tool?

b) RPKS and EukFrac:

PAIRED vs SINGLE Fastqs: In some cases we may encounter PAIRED and SINGLE Fastqs files, I wonder if in this cases the RPKS and EukFrac metrics obtained by the program are comparable between PAIRED and SINGLE Fastqs? If they are not comparable, which column should be used (I guess Total_marker_coverage based on issue #27)?
Within sample different read lengths: From what I have seeing in the forum, in this case the RPKS and EukFrac metrics should not be used (https://github.com/allind/EukDetect/issues/27). I wanted to confirm that this is still the case.
Different samples with different read lengths: In some cases we may encounter datasets that have been sequenced with different technologies, for instance Illumina MiSeq (300bp) and Illumina Hiseqs (100 or 150 bp), and present different read lenghts. I wonder if in this cases the RPKS and EukFrac metrics obtained by the program are comparable too? If they are not comparable, which column should be used (I guess Total_marker_coverage based on issue #27)?
In the documentation, is stated that "EukFrac is only relative to other eukaryotes and must be considered alongside a proxy for the absolute abundance of taxa, which is RPKS. Directly comparing changes in EukFrac alone between samples is strongly discourage". Could you please provide an example of how this should be done?

c) Abundance Heatmaps and PCAs:

What metrics would you recommend to use in the case we wanted to generate the typical heatmaps of abundance? EukFrac, RPKS or Total_marker_coverage?
What metrics would you recommend to use in the case we wanted to generate the typical ordination plots (based on CLR or Bray-Curtis)? EukFrac, RPKS or Total_marker_coverage?

Thanks in advance for any help that could be provided! Best, Sam

allind commented 1 year ago

Hi Sam, thanks for reaching out.

I think that there are some interesting questions concerning the eukaryotic fraction of the microbiome, and am excited to see what you report, but ultimately a lot of the questions that you’re asking about abundance are getting at a few fundamental problems with surveying the eukaryome using shotgun sequencing. In almost all human microbiome datasets I’ve seen, eukaryotes are very low in abundance. When you are looking at something that comprises so very little of a dataset, the relative abundance of these organisms (and also the chance of observing them at all) is driven much more by depth of sequencing and by chance variation than it is by real biological differences. This is not the case when you have a very high absolute abundance of an organism, as in the case of most meta-analyses of bacterial members of the microbiome, but it is absolutely the case for lower-abundance eukaryotes. I think many interesting questions can be asked of the eukaryome that could be included in your meta-analysis that do not require including relative abundance estimations. (To be clear - think of this work as being something where only focusing on studying the rarest, least abundant members of a community.)

To answer your questions:

a) This isn’t really an internal filter. If reads are shorter than 75 bp, eukdetect as a python package will exit with an error code. If you run the snakemake pipeline by itself, this will not throw an error and you can run it. The purpose of this is to make sure the user knows the potential risks associated with aligning very short reads - it increases the possibility of mis-assigning bacterial reads as eukaryotic. It sounds like you’ll be using the pipeline a lot, so I strongly recommend using the snakemake pipeline directly as you have more control that way.

b)

paired vs. single: I actually think that these could be directly comparable. While eukdetect does use the paired read mode in bowtie2 when possible, reads being counted treat the pairs differently. I can’t see why they wouldn’t be comparable.
within-sample different read lengths: If you’re working with a sample that has been trimmed, and untrimmed reads are not available because you’re doing a meta-analysis, this is still true. It sounds like a workaround for this would be helpful, so I’ll think about this.
different samples with different read lengths: You’re right about this - total marker coverage is comparable between the same species but the others would not be.
the documentation talking about absolute abundance is referring to the issue I talked about at the beginning of this response - specifically, relative abundance is relatively meaningless when absolute abundance is low. If you want to select samples that have high absolute abundances and compare between them, I think that could be done, but low abundance samples are fraught- potentially a solution could involve a statistical model that accounts for a mean-variance relationship.

c) Metrics for abundance should be RPKS. If you’d like to create ordination plots based on absolute abundance, again RPKS should be used.

Please reach out with any further questions, happy to discuss.

sarpiens commented 1 year ago

Thank you very much for your quick response and feedback Abigail!

Just to make sure I have understood everything correctly:

a) readlen parameter: I understand that it would be best to pre-filter reads below 75 bp, correct?

b) total marker coverage: Then I understand that this metric would be my column of interest and could be used to at least focus the study on presence-absence, with the EukDetect criteria of removing all taxa that have fewer than 4 reads that align to fewer than 2 marker genes. Although I am concerned about the "low abundance samples are fraught". Is there a read count threshold after quality control that you would recommend to filter these low abundance samples?

Thanks in advance. Best, Sam

sarpiens commented 1 year ago

Hi again with regard to the 75 bp, prefilter I have seen in a previous issue(#27) that "It's important you don't allow any alignments shorter than ~60 bp or so because shorter than this can misalign bacterial sequences to eukaryotic genes, so set the read length in the config file as 75 bp.". Then I wonder if I would be okey use a minimum filter lenght of 60 bp previous to using Eukdetect and then set read length in the config file as 75bp, since I have some older datasets that I would loose with a 75 bp minimum lenght pre-filter QC step?

allind commented 1 year ago

Hi, yes filtering out reads below 60 bp would be a good idea.

If what you want to study is presence/absence, you really only need to rely on whether something is reported by EukDetect at all. When I say low abundance samples are fraught, what I mean is that they're fraught for things like relative abundance comparisons between samples.

sarpiens commented 1 year ago

Thanks a lot for all your feedback Abigail!

Out of curiosity, how many reads do you consider that a sample should have in order to explore good enough these Eukaryote taxa in WGS metagenomic data (250 thousand, 500 thousand, 1 million, 5 million, or more)?

allind commented 1 year ago

That's a great question that's difficult to answer, because it depends on a lot of factors including the overall microbial diversity in the sample, the sample preparation method, and the abundance of the given eukaryote, among other things. This is generally true for all rarer taxa, not just eukaryotes. Some work on getting rare taxa in metagenome-assembled genomes from samples with varying levels of diversity has been done - check out Fig 3 and Fig 2G here: https://www.biorxiv.org/content/10.1101/2022.03.30.486478v2.full

sarpiens commented 1 year ago

Again Thanks a lot for all your feedback Abigail! Best, Sam

allind / EukDetect

Meta-analysis with EukDetect and some questions #38