benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
459 stars 142 forks source link

Customized Reference dataset not giving results for dada2 #1411

Closed urbagal closed 2 months ago

urbagal commented 2 years ago

I work at the Mycotics branch at the CDC and I am collaborating with the input.txt Waterborne Diseases Branch and other branches of my division to create a customized large dataset for 16S and ITS sequences.

I have used 3 publicly available databases (SILVA, RDP, and UNITE) and created a large dataset with sequences with taxonomic nomenclature at least up to the Family level. Currently, my database has 1.3 million reads with good taxonomic nomenclature. I started with qiime2 and tested my DB using mock datasets. It gave me 80% correct results and 20% wrong or only till the phyla level.

For comparison purposes, I selected dada2. I have formatted the reference database as shown on your website. Examples are shown below. I also have a version with no sample id and just the nomenclature and tested it too.

S000366203|kBacteria;pProteobacteria;cAlphaproteobacteria;oRhizobiales;fMethylocystaceae;gMethylosinus;s__uncultured_Methylocystaceae_bacterium_GW60-13-61; AGAGTTTGATCCTGGCTCAGAACGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAACGGGCGCAGCAATGCGTCAGTGGCAGACGGGTGAGTAACGCGTGGGAACGTGCCTTTCGGTTCGGAATAACTCAGGGAAACTTGAGCTAATACCGGATACGCCCTTCGGGGGAAAGATTTATTGCCGAAAGATCGGCCCGCGTCCGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCGGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGCAAGCCTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCGCCAGGGACGA S000388801|kBacteria;pProteobacteria;cAlphaproteobacteria;oRhizobiales;fMethylocystaceae;gMethylosinus;s__methanotroph_M8; CCAAGCACGTCGGGTATTGTTTAAAGAATATCAGCCAGCTGGGATTAGACACGGCCCAGACTCCTGCGGAAGCAGCAGTGGGAATATTGACAATGGGCGCAAGCTTGATCCAGCCATGCCGCGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTCGCCAGGGACGATAATGACGGTACCTGGATAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGTTCGGAATCACTGGGCGTAAAGCGCACGTAGGCGGATTGTTAAGTCAGGGGTGAAATCCCGAGGCTCAACCTCGGAACTGCCTTTGATACTGGCGATCTAGAGTCCGGGAGAGGTGAGTGGAACTGCGAGTGTAGAGGTGAAATTCGTAGATATTCGCAAGAACACCAGTGCCGAAGGCGGCTCACTGGCCCGGAACTGACGCTGAGGTG

For testing purposes, I used the representative sequence obtained from qiime2 (attached ) and tested it using the following commands:-

refdb<-"Archaea_Bac_Fungi_all_Orig_1349660_dada2_v1.fasta"

input<-"input.txt"

q<-readLines(input)

can use minBoot=80

taxa1<-assignTaxonomy(q,refdb,multithread=TRUE,tryRC=TRUE,verbose=TRUE)

print (unname(taxa1))

With my custom DB (Archaea, Bac, and Fungi sequences), I did not get any output. But, When I used the rdp_train_set_16.fa as a reference, I got the output.

Do you happen to have any suggestions to get the custom DB working? I have not edited any reads. They are exactly as from SILVA/RDP/UNITE DB, only the headers have been formatted.

benjjneb commented 2 years ago

My first guess is that it is the S######| part of the fasta header that is causing the problem. assignTaxonomy is parsing everything before the first semicolon as being the Kingdom designation. Since those numbers are all different, nothing has the same Kingdom, leading to failure to confidently assign a Kindgom (an NA), which results in all lower levels also being given NA designations.

urbagal commented 2 years ago

I removed the sample id (S#####|) section. Now it is able to identify they all belong to Bacteria. But it is still not giving me assignments for the lower nomenclatures. I also removed the prefixes (k, p, c .. s) to check if that is creating any issue. But its not making any difference in the assignment.

benjjneb commented 2 years ago

Another consideration is whether the the id lines from the different databases have consistent taxonomic nomenclatures. I'm not sure that RDP and Silva, let alone UNITE, use the same set of taxonomic names. If they don't, you'll get NA assignments where they disagree even if it's by minor spelling differences. You could test that by assigning against just the RDP part of your custom db, in which case you should get assignments.

If that test passes, it's almost certainly remaining issues with the formatting of the id lines. I'd take a close look, and you could try posting the exact id text for like 10 randomly selected entries as well.

urbagal commented 2 years ago

Hello Ben,

In the beginning when I created a large, curated database, for Archaea and Bacteria, I used SILVA and RDP database. But, I did not just add the sequences with taxonomy. I compared the nomenclature at the Family Level and selected unique from SILVA, unique from RDP and for the common families found in both DBs, I took it from SILVA. For Fungi, I just curated (removed reads with bad nomenclature). Based on your above comment, I thought there may be duplicates with different taxonomy and this might be creating a problem. Hence, I used the rescript script from qiime2 to remove the duplicate sequences from the curated fungal database. I used the uniq method and re-formatted the fasta file as per dada2 requirement and tested it with a fungal mock dataset. This worked perfectly. For Bacteria, I tried the same, but it did not work properly. Hence, I used the 'lca' method from rescript to deduplicated the dataset, formatted it and then tested it with a mock dataset and it gave me the taxonomy for all the 6 levels.

But, when I combined the above Fungal, Bacterial and Archaea sequences (deduplicated) and tested with the Bacterial and UNITE Mock dataset used above, it gave me wrong taxonomic results. So, I went back and dereplicated the whole dataset using the ‘lca’ method and that helped me to get results for the Bacterial mock dataset. Unfortunately, the fungal mock dataset which worked for the curated Fungal db mentioned above did not give me correct results for the combined Archaea, Bacterial and Fungal DB. Why should this happen, I cannot understand. Any insight?

Thanks,

Ujwal

benjjneb commented 2 years ago

It's hard to parse out exactly what might be going on, but I suspect it derives from combining different databases that are not necessarily using consistent taxonomic naming schemes. It seems to me from your comment that the difficulties are coming from references you are making that combine databases, and things work when references are made from a single database.

urbagal commented 2 years ago

Actually, I have seen that all the samples have proper nomenclature and each of them have nomenclature atleast till Family level. I also formatted the nomenclature to include K;c;o ;f;g__;s___ prefixes making it easy to separate at individual level if required. I even deduplicated them to prevent any errors.

But it was surprising to see that fungal mock sequences which gave me correct results when I tried using just the curated UNITE DB, gave me Archaea as hits when I combined Archaea, Bac and Fungi sequences.

Would it be fair to say, try to create separate DB for each of the kingdoms rather than combining them and will this be useful for my analysis in future.

Thank you again for providing me the insights. Will keep on working on the DB's and if I get any positive output will notify you.

Thank you again,

Ujwal