About the 16S non-overlap workflow

mstagliamonte commented 4 years ago

Hi,

I read your article about MetaDEGalaxy and found it very interesting. I was looking at the non-overlap paired end 16S workflow, and I would like some more details about it. Specifically, for what I understand, the clustering step is done using forward reads only. How about the taxonomy step? Are the paired-end information used in any step?

Thank you for your kind attention, Max Tagliamonte

mthang commented 4 years ago

Hi Max, Thank you for your attention on MetaDEGalaxy ! You have asked a very good question about 16S non-overlap workflow (#3 not #2). First of all, the worfklow 1 is designed to assess the user input paired-end library before running either workflow 2 (overlap) or workflow 3 (non-overlap).

Use workflow 2 if the result from workflow 1 has high percentage on paired-end library. Use workflow 3 if the result form workflow 1 has low percentage on paired-end library.

Back to your questions, the workflow 3 (non-overlap) is designed for non-overlap paired-end library based on the result from workflow 1.

When the workflow 3 is used which is the point that you mentioned in your message, only the forward-end reads are used for the rest of the workflow. In other words, the reverse-end reads will not be considered.

However, the workaround will be to pool the forward and reverse end reads together and rerun the workflow 3. That way, both forward and reverse end reads are used in workflow 3.

Hope I have answered your questions.

Best regards, Mike

mstagliamonte commented 4 years ago

Hi, Mike,

Thank you for clarifying the workflow. Unfortunately I am working with a library of short reads (hence non-overlapping), and looking at options to retain paired-end information, so to squeeze better signal out of them. I do not know how common of a problem this is (I was not involved in the library planning), but If I find something useful I will post more on this thread, if that might interest you.

Thank you for your kind attention, I will keep an eye on any updates for your pipeline. Best, Max

mthang commented 4 years ago

Hi Max, You are welcome ! May I know what's the length of your short reads ? I assume that your library is paired end and non-overlapping., so are you working on 16S or shortgun metagenomic?

Perhaps, you can consider to perform de novo assembly on your non-overlapping library. Then, you can follow my non-overlapping workflow to treat your newly assembled contigs as single end to further your analysis.

We did not include the shotgun assembly into the MetaDEgalaxy paper, but we have implemented in command line version.

best, Mike

mstagliamonte commented 4 years ago

Hi, Mike,

Thank you for the follow-up, I appreciate it. I am working with 16S V3-V4 region, but the read length was 150x2; after quality filtering that became 125 for the forward reads and 124 for the reverse, and that is the data I received. I did the clustering already for forward and reverse separately, and end up with ~ 100 OTUs max by looking at the rarefaction curves. This is much lower than I was expecting for gut microbiome data, my guess is that just there isn't enough diversity in such short reads. I am working now on using the paired end info for taxonomy classification at least, and see what happens. I'm open to advice ;-)

Best, Max

mthang commented 4 years ago

Hi Max, You are on the right track in term of data processing. There is one thing I need to confirm with you is that did you perform "joining" the paired-end data to create a longer contig after trimming and filtering?

The joining process is required before clustering.

Best Mike

mthang commented 4 years ago

Hi Max, Apology for bombarding you with another message ! I have noticed there might have an issue with your paired end read length for standard 16S sequencing protocol. 150x2 can't cover the entire V3-V4 region. The length of V3-V4 of 16S is about 550bp.

A good design for sequencing V3-V4 region should be 300x2 instead of 150x2. Then, the contigs of V3-V4 can be made after overlapping of two ends (forward + reverse).

Based on the length of your paired-end reads, de novo assembly is the only solution because the overlapping (aka joining) of forward-reverse reads are not applicable to your dataset.

Best, Mike

mstagliamonte commented 4 years ago

Hi, Mike,

No apologies needed, thank you for following this up. I am well aware of the fault in the design of the experiment, unfortunately I was not involved in it, and just given the data to analyze.

I am not sure what you are suggesting regarding the de novo assembly, Maybe there is a misunderstanding; since this is a 16S project only, I do not actually have whole genome shotgun reads.

Thanks again for your time and kind attention, Max

mthang commented 4 years ago

Hi Max, Not a problem ! Well. Although it is not a shotgun sequencing design, you can still perform de novo assembly on your existing dataset to "rescue" this data instead of throwing it away.

I am closing this issue on github.

Best, Mike

QFAB-Bioinformatics / jcu.microgvl.ansible.playbook

About the 16S non-overlap workflow #2