bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

Error in taxonomic assignment #61

Closed jmtsuji closed 5 years ago

jmtsuji commented 6 years ago

Problem description

I have successfully run MATAM on two of my metagenomic datasets with the --perform_taxonomic_assignment flag. However, I receive the following error during taxonomic assignment when I run MATAM on another one of my datasets (subset of logfile below):

INFO - Abundance calculation completed in 69.0905 seconds wall time
INFO - === Taxonomic assignment & Krona visualization ===
CRITICAL - The RDP file is no well formatted, line: 829 -   Root    rootrank    1.0 Bacteria    domain  0.91    "Chlamydiae"    phylum  0.05    Chlamydiia  class   0.05    Chlamydiales    order   0.05    Parachlamydiaceae   family  0.03    Parachlamydia   genus   0.03
Failed to parse RDP file:L227-2014-8m/matam_assembly/workdir/scaffolds.NR.min_500bp.abd.rdp.tab

Troubleshooting

I suspect that the dash ("-") following '829' in the RDP classification file may be the cause of the problem. (This dash suggests that classification was very poor for this sequence, correct?) The line mentioned in the above error message is the only one with a dash in the entire file:

$ grep "-" scaffolds.NR.min_500bp.abd.rdp.tab
829 -   Root    rootrank    1.0 Bacteria    domain  0.91    "Chlamydiae"    phylum  0.05    Chlamydiia  class   0.05    Chlamydiales    order   0.05    Parachlamydiaceae   family  0.03    Parachlamydia   genus   0.03

For comparison, here are a couple other lines from the same file:

284     Root    rootrank    1.0 Bacteria    domain  1.0 "Proteobacteria"    phylum  1.0 Betaproteobacteria  class   1.0 Burkholderiales order   1.0 Burkholderiaceae    family  1.0 Polynucleobacter    genus   1.0
160     Root    rootrank    1.0 Bacteria    domain  1.0 "Verrucomicrobia"   phylum  0.53    Verrucomicrobiae    class   0.51    Verrucomicrobialesorder 0.51    Verrucomicrobiaceae family  0.51    Roseimicrobium  genus   0.4

Possible solutions The issue may be in the read_rpd_file function in rdp.py:

def read_rpd_file(rdp_path):
    """
    parse a rdp file and return a generator
    """
    with open(rdp_path, 'r') as in_rdp_handler:
        for l in in_rdp_handler:
            l = l.strip()
            if not l or l.startswith('#'): continue
            rdp_line = re.split('"?\t+"?', l)
            rdp_line = [field.strip() for field in rdp_line]
            if len(rdp_line) != 19: # seqid + 6 taxonomic levels * 3
                logger.fatal('RDP: wrong number of fields -- %s, expected 19, line: %s' % (len(rdp_line), l))
                sys.exit('Failed to parse RDP file:%s' % rdp_path)

            yield rdp_line

The dash likely causes len(rdp_line) to be 20 instead of 19.

Could read_rdp_file need a special case for handling dashes? (E.g., These sequences could be thrown away due to low confidence.)

loic-couderc commented 6 years ago

Hi @jmtsuji, I'm not sure that the dash is related to poor previsions in RDP format. Anyway, the dash is responsible for this error because of the wrong number of fields as you pointed out. I will be able to patch the code only the next month, sorry for the delay. Thank you.

jmtsuji commented 6 years ago

@loic-couderc Thanks for the quick response.

loic-couderc commented 6 years ago

Hi @jmtsuji, Based on the log message, i can see that you are not using the v1.5.1 of MATAM. In this version, the taxonomy is slightly differently handled. Could you try the last version of MATAM to see if this error is persisting?

jmtsuji commented 6 years ago

Hi @loic-couderc ,

My apologies for the delay in replying. Thanks for the suggestion of trying the new version. I'll aim to let you know by early next week if the bugs I saw disappear once upgraded to v1.5.1.

loic-couderc commented 5 years ago

Fixed in v1.5.2