Open andreas-wilm opened 8 years ago
Do you need help on this?
Sure :) Are you sure you have the time?
The issue is this: for classification we map against a version of Greengenes that's pre-clustered at 99% id. The clustering should have happened after primer trimming though to make things comparable. So we would need to primer trim Greengenes (discard the ones not matching the primer?) and then cluster at 99%.
Andreas
I have done the trimming and clustering for SILVA in fact. If you want to stick to GG, I can help you on this.
Cool! Yes please. With vsearch
or uclust
? Input would be /mnt/genomeDB/misc/greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta
I'm unsure how exactly to assign a taxonomy to each cluster though. The existing OTU clustering came with the assignment. Any idea?
Andreas
That is tricky then... Previously I kept only sequences with species level assignment.
Chenhao.
On Tue, Jan 26, 2016 at 10:48 PM, Andreas Wilm notifications@github.com wrote:
Cool! Yes please. With vsearch or uclust? Input would be /mnt/genomeDB/misc/ greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta I'm unsure how exactly to assign a taxonomy to each cluster though. The existing OTU clustering came with the assignment. Any idea?
Andreas
— Reply to this email directly or view it on GitHub https://github.com/CSB5/GERMS_16S_pipeline/issues/1#issuecomment-175055211 .
Hey guys, thanks for chasing this. Chenhao, do I understand correctly regarding retaining only sequences with species level assignment in the gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452 entries. Only 16,869 of these can be assigned to one species. In total we have 639 unique species in there. If your suggestion is to only keep the 16,869, it seems drastic to cut out so many entries.
grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt
Cheers, Paola
Let's not do that. This will just introduce a bias. I'm happy to live with some ambiguity instead On 27 Jan 2016 22:51, "paolaflorez" notifications@github.com wrote:
Hey guys, thanks for chasing this. Chenhao, do I understand correctly regarding retaining only sequences with species level assignment in the gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452 entries. Only 16,869 of these can be assigned to one species. In total we have 639 unique species in there. If your suggestion is to only keep the 16,869, it seems drastic to cut out so many entries.
command to find out how many entries have species level designations.
grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt
Cheers, Paola
— Reply to this email directly or view it on GitHub https://github.com/CSB5/GERMS_16S_pipeline/issues/1#issuecomment-175665504 .
The classification database (99% OTU) should have been trimmed before clustering instead of using the preclustered database. Pointed out by Christophe LAY