biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

BUG: `MSA_ripe` rule fails on some non-standard amino acids #75

Open tkosciol opened 6 years ago

tkosciol commented 6 years ago

for example:

RuleException:
ValueError in line 264 of /projects/microprot/microprot/snakemake/Snakefile:
Invalid character in sequence: b'U'.
Valid characters: ['W', 'X', 'S', 'C', '-', 'B', 'R', 'H', 'I', 'M', 'N', 'V', 'K', 'F', 'A', 'Y', '*', 'Q', 'T', 'L', 'G', 'E', 'D', 'Z', 'P', '.']
Note: Use `lowercase` if your sequence contains lowercase characters not in the sequence's alphabet.
  File "/projects/microprot/microprot/snakemake/Snakefile", line 264, in __rule_MSA_ripe
  File "/projects/microprot/microprot/scripts/calculate_Neff.py", line 31, in parse_msa_file
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py", line 338, in __init__
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/skbio/sequence/_grammared_sequence.py", line 362, in _validate
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/concurrent/futures/thread.py", line 55, in run^[[0m
^[[31mExiting because a job execution failed. Look above for error message^[[0m
Trying to restart job for rule MSA_ripe with wildcards {'seq': 'NZ_JHYU01000020.1_81'}

However, it's correct behavior, i.e. we shouldn't be getting U as an amino acid, it breaks the pipeline.

A solution would be to do some pre-filtering and substitute illegal characters for *