benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Potential contributed reference database #622

Closed rschaeffer closed 5 years ago

rschaeffer commented 5 years ago

Hello Benjamin,

I've been attempting to use DADA2 for processing rbcL sequences for identifying floral resources used by bees. Everything so far has worked swimmingly. I'm at the stage however where I'd like to assign taxonomic info and recognize this may be a little bit of a challenge. A reference database exists (https://figshare.com/collections/rbcL_reference_library/3466311/1), populated with rbcL sequences for ~30K plant species, however it's not properly formatted for use with DADA2. I'm aware that coding is available in the package documentation to potentially address this, but had a couple questions/concerns. First, if I were to attempt to use such a custom db with the assignTaxonomy function, I'd need to populate it with information for all taxonomic levels correct? Or could I use taxLevels and just set it to Genus and Species, since that is all that is provided in the existing db sequence/accession info? Thanks for any tips/help that you can provide on this.

benjjneb commented 5 years ago

if I were to attempt to use such a custom db with the assignTaxonomy function, I'd need to populate it with information for all taxonomic levels correct? Or could I use taxLevels and just set it to Genus and Species, since that is all that is provided in the existing db sequence/accession info?

As long as you are consistent, you can use whatever taxonomic levels you want in the reference database. By consistent, I mean that in the ID line of the dada2-formatted fasta that defines the taxonomy associated to each reference sequence, the ordering is the same across all entries:

> Level1;Level2;Level3
ACGTA....

That is, you can't have some entries defined starting at Kingdom, and others starting at Genus, they all need to start at the same level.

Also, see a bit of documentation on formatting custom databases from our website.

rschaeffer commented 5 years ago

Thanks Benjamin. I was able to reformat the database and could successfully use it. However, I have unfortunately run into another issue. Perhaps this has more to do with the particular marker (rbcL) and/or taxonomy classifier? Nearly all sequences assign to either the first or second taxon in the custom database (Pinus spp), but I know those aren't correct, as when I BLAST them individually, I get taxa that I would expect given the nature of the samples and where they were collected (Solanum, Trifolium, etc). Any thoughts? Thanks again for the help.

benjjneb commented 5 years ago

Could you provide a "minimal example" that I could run on my end to get an idea of what is going on?

For example, your formatted database (or a smaller verison of it w/ say just the first X entries) and a few sequences to classify that you know aren't Pinus spp?

rschaeffer commented 5 years ago

Yes of course, and thank you for your help! The two files are attached. Thanks again!

Cheers, Robert

On Sat, Dec 8, 2018 at 9:14 AM Benjamin Callahan notifications@github.com wrote:

Could you provide a "minimal example" that I could run on my end to get an idea of what is going on?

For example, your formatted database (or a smaller verison of it w/ say just the first X entries) and a few sequences to classify that you know aren't Pinus spp?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/622#issuecomment-445470268, or mute the thread https://github.com/notifications/unsubscribe-auth/AH22zFNsdEe5kbhfe1RFpQvFt-yuO64Fks5u2-VKgaJpZM4YzOie .

benjjneb commented 5 years ago

The files might be too big to attache to GitHub comments. You can email them to me at benjamin DOT j DOT callahan AT gmail DOT com, or share some link to them if they are also too big for gmail.

rschaeffer commented 5 years ago

Now they're attached. Thanks again for taking a look.

rbcL.zip

benjjneb commented 5 years ago

OK, I took a look at your files. I see that the sequences come from a variety of plants. However, I think I will need a full version of your formatted database to understand what is going on. The 5-seq version you gave me has only Pinus species in it, so any sequence classified against the database will be classified as Pinus because there are no other groups or outgroups.

Can you share the full rbcL training fasta you created?

rschaeffer commented 5 years ago

Thanks for taking a peek, as well for catching my silly error. I realize now that in my original coding to reformat the sequence headers, I ended up creating an output file that only changed them for the head of the original (hence there only being a handful of Pinus sequences). Makes sense now why all of my sequences were classified as Pinus! Now that I've fixed that though, I've run into another formatting error that I'm not sure how to address.

> taxa<-assignTaxonomy(seqs,"dada_rbcL.fasta", multithread=TRUE)
Error in C_assign_taxonomy2(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int,  : 
  Invalid map from references to genus.
In addition: Warning message:
In matrix(unlist(strsplit(genus.unq, ";")), ncol = td, byrow = TRUE) :
  data length [266365] is not a sub-multiple or multiple of the number of rows [33296]

The only thing that I can think of is that there are multiple entries (with different sequences) for some taxa. I've attached the original, as well as the amended file. In the amended file, I also edited sequences with leading/trailing Ns, as well as other non-AGTC characters. Any thoughts? Thanks again!

rbcL.zip

benjjneb commented 5 years ago

Typically these sort of errors are caused by a misformatting somewhere. The first one I found in this file is in line 1521:

>Level1;Level2;Level3;Level4;Level5;"Ficus;variegata_Blume;_1825

The quote and the extra "level" are breaking the code (i.e. the _1825 bit) because they violate the expectation of the same number of levels in each id line.

Here is some code to identify all the ids w/ a different number of levels:

sq <- getSequences("dada_rbcL.fasta")
ids <- names(sq)
lens <- sapply(strsplit(ids, ";"), length)
table(lens)
which(lens != 7)

I think if you rectify those ids that it should work.

TseWue commented 5 years ago

Dear Benjamin,

I also created my own database and tried to apply it to the dada2 pipeline (great work by the way!). I considered all the formatting and altough the database could be read from the assignTaxonomy command, I got the following error message:

Error in C_assign_taxonomy(seqs, rc(seqs), refs, ref.to.genus, tax.mat.int, : Expecting a string vector: [type=NULL; required=STRSXP].

After trying alternatives from within R and googling around for further solutions I found the following page http://adv-r.had.co.nz/C-interface.html where I learnt that STRSXP actually originates from C data types. Does that mean I have to recreate C-objects within R or do you know a way to work around that? I mainly am a pythonista, so sorry, I'm no expert on R, let alone C programming language...

Thank you for your advice! Best, Tse

benjjneb commented 5 years ago

You definitely don't need to recreate any C-objects to use the package, or a custom database.

Most likely there is some formatting issue that is leading to this error. Do you have an non-ACGT characters in your DNA strings? Any blank lines, or malformated ID lines?

If I were you I'd start by trying to create a minimal example, perhaps by using the first 10 lines (or whatever) of your reference fasta file to see if it still causes this error, and expanding until it does, then use that to identify what lines might be causing this problem.

rschaeffer commented 5 years ago

Ben - That did the trick! It was just that one misformatted ref sequence repeated a handful of times throughout. Thanks again for the tips.

TseWue commented 5 years ago

I haven't found my bug yet,.. But a general question, do all the sequences need to have the exact same length in bp?

benjjneb commented 5 years ago

I haven't found my bug yet,.. But a general question, do all the sequences need to have the exact same length in bp?

No

TseWue commented 5 years ago

I just tried to run it again with the in the tutorial suggested silva database but I get the same error message.

benjjneb commented 5 years ago

@TseWue Can you open a new issue, that suggests this isn't related to reference db formats.