biocore / greengenes2

Processing support for Greengenes2
11 stars 2 forks source link

Kraken2 database building using greengenes2 #11

Closed BrinthaVP closed 4 months ago

BrinthaVP commented 4 months ago

Hi,

I have extracted the V3V4 region of greengene2 sequences and used it to build the Kraken2 database. But when this database is used, none of the reads in the fastq files are classified by Kraken2. (but almost 98%-99% of the reads are assigned when build with the old Greengene database)

I also tried using the full-length sequences as well and facing same issue.

Best, Brintha

wasade commented 4 months ago

Hi @BrinthaVP,

Can you please provide the exact commands used to construct the database, including the steps taken to extract the V3V4, as well as the commands used to map short reads to the database? And could a few thousand of the sequences to map also be shared to assist with debugging?

I have not evaluated Kraken2 in this context so the above will be helpful to understand what may be going on. For 16S, we rely on Deblur in Qiita, where Deblur was designed specifically for amplicons. Greengenes2 is oriented at the moment for 515F-806R V4 fragments as that's the set we placed; non-v4 fragments though can be mapped using the non-v4-16s action though.

The short read metagenomic workflow we use in Qiita is Woltka which uses the SHOGUN parameter set with bowtie2. However, I would anticipate that is less sensitive for 16S.

Best, Daniel

BrinthaVP commented 4 months ago

Thank you @wasade.

Below are the commands used for building the database.

Step 1: Extraction of V3V4 region (using qiime) f_primer=CCTACGGGNBGCASCAG r_primer=GGACTACNVGGGTWTCTAAT qiime feature-classifier extract-reads --i-sequences 2022.10.backbone.full-length.fna.qza --p-f-primer $f_primer --p-r-primer $r_primer --p-min-length 100 --p-max-length 550 --o-reads greengenev3v4.qza --p-n-jobs 100

Step 2: Adding the fasta to Kraken library kraken2-build --db ../GREENgenes2/ -add-to-library greengenev3v4/d572057d-569f-4520-8271-8cb19a0cac5b/data/dna-sequences.fasta

Step 3: Building taxonomy (using the script from Kraken2) build_gg_taxonomy.pl seqid2taxid.txt

The seqid2taxid.txt file is nothing but the 2022.10.taxonomy.id.txt file

Step 4: Building database kraken2-build --build --db GREENgenes2/ --threads 100

All the above steps are successful.

Step 5: kraken2 --gzip-compressed --db ../GREENgenes2/ --threads 100 --paired S3_R1_P.fastq.gz S3_R2_P.fastq.gz --report S3.report > S3.kraken; Running the above command results in the output given below:

Loading database information... done. 299213 sequences (133.76 Mbp) processed in 4.801s (3739.3 Kseq/m, 1671.64 Mbp/m). 0 sequences classified (0.00%) 299213 sequences unclassified (100.00%)

Also attached a fastq file with 1000 reads from this sample. output.fastq.gz

Though it is oriented for V4, should not the same setup work even if the V3V4 regions are extracted from full length sequences.

wasade commented 4 months ago

Thanks! What happens if the extracted reads are mapped?On May 1, 2024, at 21:40, BrinthaVP @.***> wrote: Thank you @wasade. Below are the commands used for building the database. Step 1: Extraction of V3V4 region (using qiime) f_primer=CCTACGGGNBGCASCAG r_primer=GGACTACNVGGGTWTCTAAT qiime feature-classifier extract-reads --i-sequences 2022.10.backbone.full-length.fna.qza --p-f-primer $f_primer --p-r-primer $r_primer --p-min-length 100 --p-max-length 550 --o-reads greengenev3v4.qza --p-n-jobs 100 Step 2: Adding the fasta to Kraken library kraken2-build --db ../GREENgenes2/ -add-to-library greengenev3v4/d572057d-569f-4520-8271-8cb19a0cac5b/data/dna-sequences.fasta Step 3: Building taxonomy (using the script from Kraken2) build_gg_taxonomy.pl seqid2taxid.txt The seqid2taxid.txt file is nothing but the 2022.10.taxonomy.id.txt file Step 4: Building database kraken2-build --build --db GREENgenes2/ --threads 100 All the above steps are successful. Step 5: kraken2 --gzip-compressed --db ../GREENgenes2/ --threads 100 --paired S3_R1_P.fastq.gz S3_R2_P.fastq.gz --report S3.report > S3.kraken; Running the above command results in the output given below: Loading database information... done. 299213 sequences (133.76 Mbp) processed in 4.801s (3739.3 Kseq/m, 1671.64 Mbp/m). 0 sequences classified (0.00%) 299213 sequences unclassified (100.00%) Also attached a fastq file with 1000 reads from this sample. output.fastq.gz Though it is oriented for V4, should not the same setup work even if the V3V4 regions are extracted from full length sequences.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

BrinthaVP commented 4 months ago

Hi @wasade,

On giving some random reads as inputs in blast, most of them got assigned to some species.

So I retried to build the greengene2 database using KrakenUniq instead of Kraken2 using the below command. krakenuniq-build --db GREENgenes2/ --jellyfish-bin ~/kraken-uniq/jellyfish-install/bin/jellyfish --kmer-len 31 --threads 10

And then ran the classification command as below: krakenuniq --db ../GREENgenes2/ --thread 10 --paired S3_R1_P.fastq.gz S3_R2_P.fastq.gz --report S3.report > S3.kraken;

98.55% of the sequences are classified. So, it seems something is not working in the case of Kraken2 with respect to classification. I will look into it.

Thanks !!