bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

Problem when trying to use RDP and UNITE DB as reference #68

Closed quetjaune closed 5 years ago

quetjaune commented 5 years ago

I have used MATAM effectively with the reference DB SILVA_132_SSURef (16srrna training model) and SILVA_132_LSURef (fungallsu training model), but when trying to use RDP_ITS_v2 (fungalits_warcup training model) or UNITE_sh_dynamic (fungalits_unite training model) I get this error:

INFO - === Taxonomic assignment & Krona visualization === CRITICAL - RDP: wrong number of fields -- 22, expected 19, line: 20 Fungi domain 1.0 Basidiomycota phylum 0.71 Agaricomycetes class 0.62 Russulales order 0.5 Russulaceae family 0.49 Russula genus 0.48 Russula olivacea species 0.48

How can I fix it? Could you provide a link to download a DB model for RDP and UNITE? Thanks in advances

loic-couderc commented 5 years ago

Hi @quetjaune,

In MATAM, we use RDP with the --fixrank argument to ensure a consistency in taxonomic path prediction. (i.e. RDP will only outputs the results for a list of selected ranks in the following order: domain, phylum, class, order, family and genus).

The bug comes from a wrong number of fields in the file generated by RDP as the species level is present. This behaviour seems to be in conflict with the documentation of RDP.

Nevertheless, to tackle down this issue and let MATAM finish the process, I think that updating the rdp.py script to accept 22 fields would do the trick. (But if there is a mixin in the taxonomy depth, it won't.)

To achieve this goal: Determine where your miniconda install directory is located (if MATAM have been installed with the conda package)

type matam_assembly.py 
matam_assembly.py is /home/ubuntu/miniconda3/bin/matam_assembly.py

In my case: /home/ubuntu/miniconda3/

The rdp.py script will be located in: $MINICONDA/opt/matam-v1.5.2/scripts/rdp.py

Then apply the following patch : rdp.patch.txt

#apply the patch
patch $MINICONDA/opt/matam-v1.5.2/scripts/rdp.py < rdp.patch.txt

To avoid to recompute all the steps of MATAM, you can add this arguments on your command line to restart MATAM: -v --perform_taxonomic_assignment --resume_from abundance_calculation

As this modification is a quick fix, if you want to undo the changes:

patch --reverse $MINICONDA/opt/matam-v1.5.2/scripts/rdp.py < rdp.patch.txt

Finally, as there is a conflict in the RDP documentation, I would strongly recommend you to check the generated RDP file (rdp.tab) to ensure that the taxonomic ranks follow the convention: domain, phylum, class, order, family, genus,species.

quetjaune commented 5 years ago

Hi @loic-couderc, Thank you so much for your clearly detailed answer!!

I already included the patch in rdp.py and is working perfectly! The rdp.tab file generated follow the convention: domain, phylum, class, order, family, genus,species. Also it coincide with the output obtained when using directly the online classifier tool

Thanks for contributing with this amazing software!