benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
470 stars 142 forks source link

Leptospira species targeting secY gene (instead of 16S) using AmpSeq method? #1993

Open marwa38 opened 3 months ago

marwa38 commented 3 months ago

Hi team

Could you please let me know if this fasta file (screenshot below) work fine for dada2 pipeline to be used instead of silva 1.38? DO you think that is compatible with assignTaxonomy(). This fasta file was created by the lab for the Leptospira species known (currently 69 known species).

We are not doing 16S microbiota but secY gene were some info are shared below Target organism: Leptospira spp. (currently 69 known species) Amplicon Sequencing method: AmpSeq image image

We were adviced to run dada2 by a previous collague who ran it using this specific aforementioned gene and sequencing.

Thanks in advance Marwa

benjjneb commented 3 months ago

No that does not look like a format that works with assignTaxonomy. You can see the reference format for assignTaxonomy described here: https://benjjneb.github.io/dada2/training.html#formatting-custom-databases

khadijamajd commented 3 months ago

Hello team, Thank you so much for your reply. Do you think that this format now will work with assignTaxonomy() described by DADA2? Is there anything that need to be changed?

Thanks in advance

RefSEq
benjjneb commented 3 months ago

The separators between taxonomic levels need to be semicolons, not underscores.

khadijamajd commented 3 months ago

Hello team,

Thank you for your prompt response. I think this should be okay now.
RefSEq Do I also have to use semicolons instead of space between ID and Genus species in assignSpecies as well or the space in between is fine? Kindly assist.

Many thanks AssignSpecies

benjjneb commented 3 months ago

The addSpecies format uses spaces, not semicolons.

You should probably clean up the double >> symbol at the start of your fasta ID lines to >. And I don't remember if the line-ending semicolon is required for the assignTaxonomy format, but given that's in the official description of the format I'd probably add it in.

khadijamajd commented 3 months ago

Thank you so much, this is very helpful :)

marwa38 commented 3 months ago

Hi again Can't we add Level 7 in the assignTaxonomy? i.e. species and leave semicolons afterwards too? instead of using assignSpecies? Like this image What is meant by the ID in the assignSpecies() format? is that the accession number? is the ID optional or mandatory to be added? like below (we just added Genus and species name after adding Leptospira considering that is the ID?) image

Thanks again. Marwa

benjjneb commented 3 months ago

Can't we add Level 7 in the assignTaxonomy? i.e. species and leave semicolons afterwards too? instead of using assignSpecies?

Yes.

What is meant by the ID in the assignSpecies() format? is that the accession number? is the ID optional or mandatory to be added? like below (we just added Genus and species name after adding Leptospira considering that is the ID?)

Yes it is usually something like an accession number. It is "mandatory" in the sense that it has to be included in the formatted ID line, but there isn't a requirement that it is real. So your workaround of just putting in "Leptospira" in the ID position is fine.

khadijamajd commented 2 months ago

Hi @benjjneb , I am following up on your previous advice regarding DADA2. I used the FASTA file containing my amplicon sequences, formatted according to the assignTaxonomy() function recommendations. I then used the nf-core/ampliseq pipeline to identify Leptospira at the genus level. However, I encountered an issue with repeated Taxa IDs assigned to different Leptospira species, please see attached. I'd now like to use the DADA2 pipeline (https://benjjneb.github.io/dada2/tutorial.html) specifically for denoising, merging, and chimera control steps to improve the data quality before subsequent analyses. Unfortunately, I lack experience with the necessary sequence preparation steps required by the pipeline. Including, demultiplexing, adapter trimming and removing non-biological adapter sequences from the reads.

Could you please advise me on the necessary tools and steps to prepare my current FASTA file for processing with the DADA2 pipeline and achieve the desired denoising, merging, and chimera control steps?

Thank you in advance. Khadija ASV_taxa_species.csv

ASV_tax_Lepto

benjjneb commented 2 months ago

However, I encountered an issue with repeated Taxa IDs assigned to different Leptospira species, please see attached.

I'm not sure what this means.

I'd now like to use the DADA2 pipeline (https://benjjneb.github.io/dada2/tutorial.html) specifically for denoising, merging, and chimera control steps to improve the data quality before subsequent analyses. Unfortunately, I lack experience with the necessary sequence preparation steps required by the pipeline. Including, demultiplexing, adapter trimming and removing non-biological adapter sequences from the reads.

Could you please advise me on the necessary tools and steps to prepare my current FASTA file for processing with the DADA2 pipeline and achieve the desired denoising, merging, and chimera control steps?

The DADA2 tutorial that you linked is the place to start for understanding how to use DADA2 on your sequencing data.

DADA2 is not intended for use with fasta data, but rather with the fastq data (that also has quality scores) that you get from amplicon sequencing measurements.