CSB5 / GERMS_16S_pipeline

Pipeline for Illumina shotgun sequencing of 16S rRNA amplicon sequences
14 stars 7 forks source link

Trim GG before clustering #1

Open andreas-wilm opened 8 years ago

andreas-wilm commented 8 years ago

The classification database (99% OTU) should have been trimmed before clustering instead of using the preclustered database. Pointed out by Christophe LAY

lch14forever commented 8 years ago

Do you need help on this?

andreas-wilm commented 8 years ago

Sure :) Are you sure you have the time?

The issue is this: for classification we map against a version of Greengenes that's pre-clustered at 99% id. The clustering should have happened after primer trimming though to make things comparable. So we would need to primer trim Greengenes (discard the ones not matching the primer?) and then cluster at 99%.

Andreas

lch14forever commented 8 years ago

I have done the trimming and clustering for SILVA in fact. If you want to stick to GG, I can help you on this.

andreas-wilm commented 8 years ago

Cool! Yes please. With vsearch or uclust? Input would be /mnt/genomeDB/misc/greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta I'm unsure how exactly to assign a taxonomy to each cluster though. The existing OTU clustering came with the assignment. Any idea?

Andreas

lch14forever commented 8 years ago

That is tricky then... Previously I kept only sequences with species level assignment.

Chenhao.

On Tue, Jan 26, 2016 at 10:48 PM, Andreas Wilm notifications@github.com wrote:

Cool! Yes please. With vsearch or uclust? Input would be /mnt/genomeDB/misc/ greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta I'm unsure how exactly to assign a taxonomy to each cluster though. The existing OTU clustering came with the assignment. Any idea?

Andreas

— Reply to this email directly or view it on GitHub https://github.com/CSB5/GERMS_16S_pipeline/issues/1#issuecomment-175055211 .

paolaflorez commented 8 years ago

Hey guys, thanks for chasing this. Chenhao, do I understand correctly regarding retaining only sequences with species level assignment in the gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452 entries. Only 16,869 of these can be assigned to one species. In total we have 639 unique species in there. If your suggestion is to only keep the 16,869, it seems drastic to cut out so many entries.

command to find out how many entries have species level designations.

grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt

Cheers, Paola

andreas-wilm commented 8 years ago

Let's not do that. This will just introduce a bias. I'm happy to live with some ambiguity instead On 27 Jan 2016 22:51, "paolaflorez" notifications@github.com wrote:

Hey guys, thanks for chasing this. Chenhao, do I understand correctly regarding retaining only sequences with species level assignment in the gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452 entries. Only 16,869 of these can be assigned to one species. In total we have 639 unique species in there. If your suggestion is to only keep the 16,869, it seems drastic to cut out so many entries.

command to find out how many entries have species level designations.

grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt

Cheers, Paola

— Reply to this email directly or view it on GitHub https://github.com/CSB5/GERMS_16S_pipeline/issues/1#issuecomment-175665504 .