UCLOrengoGroup / cath-tools

Protein structure comparison tools such as SSAP and SNAP
http://cath-tools.readthedocs.io
GNU General Public License v3.0
57 stars 14 forks source link

CRH fails with "Cannot resolve_boundary for mis-ordered data" on sensible-looking data #31

Closed tonyelewis closed 7 years ago

tonyelewis commented 7 years ago

Jon has found (what looks very much to be) an error with cath-resolve-hits. I've been able to reduce the input data required to reproduce the error message from Jon's ~4.8Gb file (!) down to the attached ~4.1Kb file.

The error can be reproduced with:

cath-resolve-hits --input-format hmmsearch_out jon_problem.20170308.hmmsearch.txt

...which gives an error:

2017-03-08 12:34:39.954581 [cath-resolve-hits|error  ] Unable to parse/process resolve-hits input data file "jon_problem.20170308.hmmsearch.txt" of format hmmsearch_out. Error was:
Cannot resolve_boundary for mis-ordered data

jon_problem.20170308.hmmsearch.txt

tonyelewis commented 7 years ago

Closing this after confirming with Jon that cca8fb43e712e710c8eaee4553fa723e137d1c2a has now fixed the problem on the original file too.

tonyelewis commented 7 years ago

More info on what was going wrong:

The error was occurring after CRH had chosen the hits and was then resolving the boundaries.

In the problem case, two hits had correctly been allowed together (ie been deemed non-conflicting) but only because short segments had been removed due to falling below the --min-seg-length threshold. The later, boundary-resolving code is given those short segments (so it can do more informative things, eg in HTML output) but it wasn't properly handling the --min-seg-length issue and so was getting confused by them.