UCL / cathpy

Python Bioinformatics Toolkit for CATH (Protein Classification Database @ UCL)
http://cathdb.info
Other
14 stars 5 forks source link

Ignoring GS record with incorrect columns #11

Open bordin89 opened 4 years ago

bordin89 commented 4 years ago

I've encountered this issue when running cath-align-summary on 4.3 stockholms.

Here's the command I used:

(venv) -bash-4.2$ cath-align-summary -d /data/v4_3_0.staging/003.correct_funfam_sto_headers/output.seed_alignments.sto/ -f stockholm --suffix sto --skipempty

`2020-08-07 12:20:10,066 WARNING  | ignoring GS record with incorrect columns (<string>:450 "#=GS 1oxgA01/16-27_121-232AC P00766")
2020-08-07 12:20:10,066 WARNING  | ignoring GS record with incorrect columns (<string>:478 "#=GS 1k2i101/18-27_121-232AC P00766")
2020-08-07 12:20:10,066 WARNING  | ignoring GS record with incorrect columns (<string>:513 "#=GS 1gl1C01/16-27_121-232AC P00766")
2020-08-07 12:20:10,066 WARNING  | ignoring GS record with incorrect columns (<string>:520 "#=GS 1gl1B01/16-27_121-232AC P00766")
2020-08-07 12:20:10,066 WARNING  | ignoring GS record with incorrect columns (<string>:527 "#=GS 1gl1A01/16-27_121-232AC P00766")
2020-08-07 12:20:10,066 WARNING  | ignoring GS record with incorrect columns (<string>:534 "#=GS 1gl0E01/17-27_121-232AC P00766")
2020-08-07 12:20:10,066 WARNING  | ignoring GS record with incorrect columns (<string>:576 "#=GS 1gcdA01/16-27_121-232AC P00766")
2020-08-07 12:20:10,067 WARNING  | ignoring GS record with incorrect columns (<string>:583 "#=GS 1ex3A01/16-27_121-232AC P00766")
2020-08-07 12:20:10,067 WARNING  | ignoring GS record with incorrect columns (<string>:590 "#=GS 
1dlkD01/1-12_106-217AC P00766")
2020-08-07 12:20:10,067 WARNING  | ignoring GS record with incorrect columns (<string>:597 "#=GS 1dlkB01/1-12_106-217AC P00766")
2020-08-07 12:20:10,067 WARNING  | ignoring GS record with incorrect columns (<string>:611 "#=GS 1chgA01/16-27_121-232AC P00766")
2020-08-07 12:20:10,067 WARNING  | ignoring GS record with incorrect columns (<string>:618 "#=GS 1cgjE01/16-27_121-232AC P00766")
2020-08-07 12:20:10,067 WARNING  | ignoring GS record with incorrect columns (<string>:625 "#=GS 1cgiE01/16-27_121-232AC P00766")
2020-08-07 12:20:10,067 WARNING  | ignoring GS record with incorrect columns (<string>:667 "#=GS 1acbE01/16-27_121-232AC P00766")
2020-08-07 12:20:10,089 WARNING  | ignoring GS record with incorrect columns (<string>:6 "#=GS 1fujD02/13-109_211-221AC P24158")
2020-08-07 12:20:10,089 WARNING  | ignoring GS record with incorrect columns (<string>:8 "#=GS 1fujD02/13-109_211-221DE Myeloblastin")
2020-08-07 12:20:10,089 WARNING  | ignoring GS record with incorrect columns (<string>:11 "#=GS 1fujC02/13-109_211-221AC P24158")
2020-08-07 12:20:10,089 WARNING  | ignoring GS record with incorrect columns (<string>:13 "#=GS 1fujC02/13-109_211-221DE Myeloblastin")
2020-08-07 12:20:10,089 WARNING  | ignoring GS record with incorrect columns (<string>:16 "#=GS 1fujB02/13-109_211-221AC P24158")
2020-08-07 12:20:10,089 WARNING  | ignoring GS record with incorrect columns (<string>:18 "#=GS 1fujB02/13-109_211-221DE Myeloblastin")
2020-08-07 12:20:10,089 WARNING  | ignoring GS record with incorrect columns (<string>:21 "#=GS 1fujA02/13-109_211-221AC P24158")
2020-08-07 12:20:10,090 WARNING  | ignoring GS record with incorrect columns (<string>:23 "#=GS 1fujA02/13-109_211-221DE Myeloblastin")
Traceback (most recent call last):
  File "/cath/people2/ucbtnb4/venv/lib64/python3.6/site-packages/cathpy/core/util.py", line 650, in run
    aln = self._file_parser(aln_file)
  File "/cath/people2/ucbtnb4/venv/lib64/python3.6/site-packages/cathpy/core/align.py", line 1022, in from_stockholm
    seq_id, seq_aa = line.split()
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/cath/people2/ucbtnb4/venv/bin/cath-align-summary", line 54, in <module>
    entries = aln_sum.run()
  File "/cath/people2/ucbtnb4/venv/lib64/python3.6/site-packages/cathpy/core/util.py", line 653, in run
    aln_file, self._parser_name.__name__))
AttributeError: 'AlignmentSummaryRunner' object has no attribute '_parser_name'`

Can we expect empty stockholm files? What seems to be the problem for this?

sillitoe commented 4 years ago

Thanks for reporting Nico.

Looks like an issue with white space formatting:

#=GS 1fujA02/13-109_211-221DE Myeloblastin

Should be:

#=GS 1fujA02/13-109_211-221 DE Myeloblastin

Looking into causes, fixes, and affected files now ...

sillitoe commented 4 years ago

For the record, there is also one empty stockholm file:

seed_alignments.sto/1.20.1070.10/seed_alignments/1.20.1070.10-FF-000001.sto
sillitoe commented 4 years ago

Note:

File "/cath/people2/ucbtnb4/venv/lib64/python3.6/site-packages/cathpy/core/util.py", line 653, in run
    aln_file, self._parser_name.__name__))

This line doesn't exist in HEAD - might be worth updating your local repo.

bordin89 commented 4 years ago

I tried uninstalling cathpy, updating pip and cathpy and I get cathpy-0.3.1.0. Is this the latest version? I still get some errors

`(venv) -bash-4.2$ cath-align-summary -d /data/v4_2_0/funfam/families/4.10.990.10 -f stockholm Traceback (most recent call last): File "/cath/people2/ucbtnb4/venv/lib64/python3.6/site-packages/cathpy/core/util.py", line 650, in run aln = self._file_parser(aln_file) File "/cath/people2/ucbtnb4/venv/lib64/python3.6/site-packages/cathpy/core/align.py", line 928, in from_stockholm assert sto_header.startswith('# STOCKHOLM 1.0') AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/cath/people2/ucbtnb4/venv/bin/cath-align-summary", line 54, in entries = aln_sum.run() File "/cath/people2/ucbtnb4/venv/lib64/python3.6/site-packages/cathpy/core/util.py", line 653, in run aln_file, self._parser_name.name)) AttributeError: 'AlignmentSummaryRunner' object has no attribute '_parser_name'`

sillitoe commented 4 years ago

Yup, cathpy is currently on 0.3.10 in PyPi - I thought you might be using a local install (eg from GitHub)

I've just run this with a clean install (on rodan) and everything seemed to work.

$ hostname
rodan.biochem.ucl.ac.uk

$ pwd
/data/v4_3_0.staging/01-filter-alignments

$ head -n 5 cath-align-summary.log
# path  aln_length      seq_count       dops    gap_per
seed_alignments.sto/2.40.30.70/seed_alignments/2.40.30.70-FF-000021.sto     44      1   0.00   0.00
seed_alignments.sto/2.40.30.70/seed_alignments/2.40.30.70-FF-000023.sto    101      1   0.00   0.00
seed_alignments.sto/2.40.30.70/seed_alignments/2.40.30.70-FF-000007.sto    150      7   2.75   0.00
seed_alignments.sto/2.40.30.70/seed_alignments/2.40.30.70-FF-000011.sto    126      2  32.86   7.94

Did you try a clean install (ie completely clean venv?)

bordin89 commented 4 years ago

Yep, clean venv and clean install.