allind / EukDetect

MIT License
40 stars 15 forks source link

Meta-analysis with EukDetect and some questions #38

Open sarpiens opened 1 year ago

sarpiens commented 1 year ago

Hello thanks for developing this interesting tool!

I'm preparing a meta-analysis, and I wanted to explore the eukaryotic component of the microbiome with EukDetect. However, after reviewing the documentation and the forum I have some questions:

a) readlen parameter: I'm not sure if this is an internal filter (if it is, I haven't seen it indicated in the tool diagram) or if we should filter out readings smaller than 75 bp in the QC step before using the tool?

b) RPKS and EukFrac:

c) Abundance Heatmaps and PCAs:

Thanks in advance for any help that could be provided! Best, Sam

allind commented 1 year ago

Hi Sam, thanks for reaching out.

I think that there are some interesting questions concerning the eukaryotic fraction of the microbiome, and am excited to see what you report, but ultimately a lot of the questions that you’re asking about abundance are getting at a few fundamental problems with surveying the eukaryome using shotgun sequencing. In almost all human microbiome datasets I’ve seen, eukaryotes are very low in abundance. When you are looking at something that comprises so very little of a dataset, the relative abundance of these organisms (and also the chance of observing them at all) is driven much more by depth of sequencing and by chance variation than it is by real biological differences. This is not the case when you have a very high absolute abundance of an organism, as in the case of most meta-analyses of bacterial members of the microbiome, but it is absolutely the case for lower-abundance eukaryotes. I think many interesting questions can be asked of the eukaryome that could be included in your meta-analysis that do not require including relative abundance estimations. (To be clear - think of this work as being something where only focusing on studying the rarest, least abundant members of a community.)

To answer your questions:

a) This isn’t really an internal filter. If reads are shorter than 75 bp, eukdetect as a python package will exit with an error code. If you run the snakemake pipeline by itself, this will not throw an error and you can run it. The purpose of this is to make sure the user knows the potential risks associated with aligning very short reads - it increases the possibility of mis-assigning bacterial reads as eukaryotic. It sounds like you’ll be using the pipeline a lot, so I strongly recommend using the snakemake pipeline directly as you have more control that way.

b)

c) Metrics for abundance should be RPKS. If you’d like to create ordination plots based on absolute abundance, again RPKS should be used.

Please reach out with any further questions, happy to discuss.

sarpiens commented 1 year ago

Thank you very much for your quick response and feedback Abigail!

Just to make sure I have understood everything correctly:

a) readlen parameter: I understand that it would be best to pre-filter reads below 75 bp, correct?

b) total marker coverage: Then I understand that this metric would be my column of interest and could be used to at least focus the study on presence-absence, with the EukDetect criteria of removing all taxa that have fewer than 4 reads that align to fewer than 2 marker genes. Although I am concerned about the "low abundance samples are fraught". Is there a read count threshold after quality control that you would recommend to filter these low abundance samples?

Thanks in advance. Best, Sam

sarpiens commented 1 year ago

Hi again with regard to the 75 bp, prefilter I have seen in a previous issue(#27) that "It's important you don't allow any alignments shorter than ~60 bp or so because shorter than this can misalign bacterial sequences to eukaryotic genes, so set the read length in the config file as 75 bp.". Then I wonder if I would be okey use a minimum filter lenght of 60 bp previous to using Eukdetect and then set read length in the config file as 75bp, since I have some older datasets that I would loose with a 75 bp minimum lenght pre-filter QC step?

allind commented 1 year ago

Hi, yes filtering out reads below 60 bp would be a good idea.

If what you want to study is presence/absence, you really only need to rely on whether something is reported by EukDetect at all. When I say low abundance samples are fraught, what I mean is that they're fraught for things like relative abundance comparisons between samples.

sarpiens commented 1 year ago

Thanks a lot for all your feedback Abigail!

Out of curiosity, how many reads do you consider that a sample should have in order to explore good enough these Eukaryote taxa in WGS metagenomic data (250 thousand, 500 thousand, 1 million, 5 million, or more)?

allind commented 1 year ago

That's a great question that's difficult to answer, because it depends on a lot of factors including the overall microbial diversity in the sample, the sample preparation method, and the abundance of the given eukaryote, among other things. This is generally true for all rarer taxa, not just eukaryotes. Some work on getting rare taxa in metagenome-assembled genomes from samples with varying levels of diversity has been done - check out Fig 3 and Fig 2G here: https://www.biorxiv.org/content/10.1101/2022.03.30.486478v2.full

sarpiens commented 1 year ago

Again Thanks a lot for all your feedback Abigail! Best, Sam