biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

BUG: split_sequence error, possible HHsearch issue #55

Closed tkosciol closed 7 years ago

tkosciol commented 7 years ago

split_sequence is producing errors for some targets, identifying part of the sequence name as a from-to number, e.g.:

Error in job split_PDB while creating output file /localscratch/microprot/microprot_gneg2-1885_403330/GRAMNEG_T1D_5168/02-split_pdb/GRAMNEG_T1D_5168.^[[0m
^[[31mRuleException:
ValueError in line 148 of /projects/microprot/microprot/snakemake/Snakefile:
invalid literal for int() with base 10: 'EG_T1D_51'
  File "/projects/microprot/microprot/snakemake/Snakefile", line 148, in __rule_split_PDB
  File "/projects/microprot/microprot/scripts/split_search.py", line 561, in mask_sequence
  File "/projects/microprot/microprot/scripts/split_search.py", line 273, in parse_pdb_match
  File "/projects/microprot/microprot/scripts/split_search.py", line 223, in _parse_hit_block
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/concurrent/futures/thread.py", line 55, in run^[[0m
^[[31mExiting because a job execution failed. Look above for error message^[[0m
Trying to restart job for rule split_PDB with wildcards {'seq': 'GRAMNEG_T1D_5168'}

Example data and logs are on Barnacle in: /projects/microprot/tmp/split_sequence_errors

sjanssen2 commented 7 years ago

I think I fixed my bug, please review #56. In essence: To determine the start column of the alignment content, I first split the line at \s+, picked the forth element and determined this position by searching for content in the line. This is necessary since split at \s+ might collapse several whitespaces. The problem arose, when alignment content was e.g. only one column with context "n", but "n" also appeared in the sequence name, i.e. before the content :-/

tkosciol commented 7 years ago

resolved by PR #56