biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

BUG: split_search parsing HHsearch output #33

Closed tkosciol closed 7 years ago

tkosciol commented 7 years ago

in some cases, e.g. (middle line)

361 2w0m_A SSO2452; RECA, SSPF, un  92.2   0.022 5.9E-07   45.9   0.0   27  222-248    22-48  (235)
362 3cmu_A Protein RECA, recombina  92.2   0.022 5.9E-07   62.6   0.0   28  222-249  1080-1107(2050)
363 4cr2_I 26S protease regulatory  92.2   0.022   6E-07   54.3   0.0   27  221-247   214-240 (437)

split_sequence is unable to read the correct line in:

        hit[_HEADER[-9]] = float(fields[-9])

and, most likely subsequent fileds.

Example file: /projects/microprot/benchmarking/snakemake_test/2_pdb.out

Running: python /projects/microprot/microprot/scripts/split_search.py 2_pdb.out 2.faa -o 2_pdb -p 0.95 gives error:

Traceback (most recent call last):
  File "/projects/microprot/microprot/scripts/split_search.py", line 96, in _parse_hit_summary_line
    hit[_HEADER[-9]] = float(fields[-9])
ValueError: could not convert string to float: 'recombina'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/projects/microprot/microprot/scripts/split_search.py", line 605, in <module>
    _split_search()
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "/projects/microprot/microprot/scripts/split_search.py", line 598, in _split_search
    frag_len)
  File "/projects/microprot/microprot/scripts/split_search.py", line 499, in mask_sequence
    hits = parse_pdb_match(hhsuite_fp)
  File "/projects/microprot/microprot/scripts/split_search.py", line 249, in parse_pdb_match
    hits.append(_parse_hit_summary_line(line))
  File "/projects/microprot/microprot/scripts/split_search.py", line 136, in _parse_hit_summary_line
    raise ValueError("Unexpected field. Check if line is a HHsearch hit"
ValueError: Unexpected field. Check if line is a HHsearch hit summary line.
tkosciol commented 7 years ago

This is poor design on HHsuite side; I guess it needs fixing by using an approach similar to PDB format. Specific rangers correspond to different fileds, e.g.

1-30  name
31-37 Prob
tkosciol commented 7 years ago

please commit changes to split_search branch

sjanssen2 commented 7 years ago

I am not fully convinced if we are able to find out the exact borders (in terms of character position) between fields. It might be the case that they are dynamically resized if overall numbers are very long?! I'd prefer to stick to my current solution unless you find more counter examples.

tkosciol commented 7 years ago

solved by PR #36