Any equivalent of exclude_seqs_by_blast?

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

http://benjjneb.github.io/dada2/

GNU Lesser General Public License v3.0

471 stars 143 forks source link

Any equivalent of exclude_seqs_by_blast? #360

Closed nizumakun closed 6 years ago

nizumakun commented 7 years ago

Hello I am really new to programming languages and using Dada2 is just my experience number two next to QIIME. Maybe my questions would sound little novice, but I would appreciate any kind of help. So I used Dada2 to run an analsysis on environmental samples where 16s gene was being detected. I followed the workflow as in the tutorial and used the proposed "./rdp_train_set_16.fa.gz" as a reference to compare against. Later on I realized that "./rdp_train_set_16.fa.gz" doesn't include a genius that I was specifically looking for. I tried to add few sequences manually to the same document but the output wasn't any different. When I tried my customized reference database with the target genius only the result came out even worse after assigning taxonomy with many N/As. My main questions would be: 1/ Is there a dada2 tool similar to the qiime script: exclude_seqs_by_blast? so I can keep only sequences for my target organism 2/ Is there a minimum number of sequences to put in a customized database in order for it to work properly? especially in this context where I want to pull specific sequences out of a whole microbial pool.

I know Im asking something that could be obvious yet specific. But I appreciate any help on this matter.

Thanks

benjjneb commented 7 years ago

As I understand it, you have a 16S dataset of a community, and you are failing to find some particular genus in the results of assignTaxonomy with the RDP trainset?

Can you clarify what your goal here is? You only care about sequences from that one genus? Which is? Do you have a suitable set of reference sequences for that genus? Is there a further purpose once those sequences are identified?

Depending on what you are trying to do there are probably a couple ways to do it. Also, you may want to just try the Silva database as well, in case it has your genus of interest.

nizumakun commented 7 years ago

Hello Dr Callahan

Thanks for helping. Actually yes, I have a 16s dataset of a community. In the result of assignTaxonomy I failed to find the genius Frankia (that is, the only genius I would care about) which I was supposed to pull out of the whole community dataset, try to hypothesize which are the most abundant Frankia sequences per sample, then extract all unique sequences per sample into a fasta file for further alignments and phylogenetic tree construction. Would there be a way to extract these specific sequences? Also, when I included few reference Frankia sequences into the RDP trainset, reads were still not assigned to the Frankia genius (Even though when I manually blasted some reads they turned out to be actual Frankia, thus making sure they are actually there). So I wasn't so sure if there was a minimum number of required reference sequences in order for the program to take them into consideration. E.g: Is a set of 100 reference sequences enough or do I need more. Especially that I am going to re run the analysis using a genius specific gene, for which I would need to manually prepare a set of reference sequences to compare to.

Thanks a lot

benjjneb commented 7 years ago

I'd recommend trying the Silva database, which contains named Frankia sequences, instead of the RDP database, which does not. That will probably solve your problem right there.

Alternatively, since your goal here is not to do a classification in the absence of prior information, but to identify Frankia sequences when you know they exist, you can just compare all of your sequences to a set of known Frankia sequences, and pick out the ones that are close enough.

You can do that by reading in the sequences from a fasta file (e.g. fsqs <- getSequences("my.frankia.fa")) and then finding the minimum aligned hamming distance between each sequences and that set of sequences (e.g. ham.min <- min(nwhamming(sq, fsqs, vec=TRUE))). Then choose all sequences with ham.min less than sum cutoff that you choose.

nizumakun commented 7 years ago

Thanks so much! It worked much better with silva database amended with extra sequences I have added. Still with deep look into taxa I have found many assigned to a another genius while they shouldn't (biologically irrelevant let's say, E.g: they show a blast similarity of 98% to frankia and 98% to an x genius but the taxonomy assignment refer them to x instead of frankia, knowing that x shouldn't be found in such environment). At that point I think it's a just a matter of how rich is the database with my target sequence, I'm I right? On another note, I have also re run using a genius specific gene, so here my output should be Frankia only sequences. In term of diversity data looked normal but abundance wise things were questionable. Do I take the numbers from SeqtabNoc (the way ASV table was called in the paper) as abundance of each sequence? If not, is there anyway to convert that table into a biom format table so I can easily sort unique sequences for each sample together with their corresponding abundance?

Sorry if Im asking a lot, but I really appreciate your help. As you already helped me a lot solving the first issue.

Thanks

benjjneb commented 7 years ago

These general purpose databases are intended for general purpose use. You can do better with databases tailored to your specific environment, but it is more work of course. In your case, if you excluded the nearby sequences from species that you know don't appear in your environment, the algorithm would give you the expected results. This is one reason to consider the targeted approach I described earlier (alignment to Frankia sequences).

Yes, the numbers in the sequence-table are the abundances.

You can write to biom using the biomformat package, see #325 for an example. However, R is a very flexible and powerful environment for doing things like sorting/subsetting/reordering, and you can probably use base R commands on your sequence table to accomplish what you want.

nizumakun commented 7 years ago

Thank you very much! really appreciate your help