biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
583 stars 92 forks source link

Failed to deserialize category 'entity' with ValueError: No closing quotation #570

Open 0ut0fcontrol opened 1 month ago

0ut0fcontrol commented 1 month ago

1n5m.cif pdbx_description in entity category has no closing quotation: '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]'

There are a total of 3 problematic examples in the wwPDB database: 1n5m, 6szp, 1tsl

I can fix it, but I'm not sure about the best way to do so. Do you have any suggestions?

code and trackback

# test_example.py
# biotite=0.40.0
import biotite.structure.io.pdbx as pdbx
from biotite.database import rcsb
import biotite.structure as struc
import biotite.structure.io as strucio

# There are a total of 3 problematic examples in the wwPDB database.
pdb_id = "1n5m"
# pdb_id = "6szp"
# pdb_id = "1tsl"

try:
    cif_file = pdbx.CIFFile.read(f"/tmp/{pdb_id}.cif")
except:
    rcsb.fetch(pdb_id, "cif", target_path="/tmp")
    cif_file = pdbx.CIFFile.read(f"/tmp/{pdb_id}.cif")

entity = cif_file.block["entity"]
$ python test_example.py 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 658, in __getitem__
    category = CIFCategory.deserialize(category, expect_whitespace)
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 382, in deserialize
    category_dict = CIFCategory._deserialize_looped(
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 486, in _deserialize_looped
    values = shlex.split(data_line)
  File "/usr/lib/python3.9/shlex.py", line 315, in split
    return list(lex)
  File "/usr/lib/python3.9/shlex.py", line 300, in __next__
    token = self.get_token()
  File "/usr/lib/python3.9/shlex.py", line 109, in get_token
    raw = self.read_token()
  File "/usr/lib/python3.9/shlex.py", line 191, in read_token
    raise ValueError("No closing quotation")
ValueError: No closing quotation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/test_example.py", line 18, in <module>
    entity = cif_file.block["entity"]
  File "/usr/local/lib/python3.9/dist-packages/biotite/structure/io/pdbx/cif.py", line 660, in __getitem__
    raise DeserializationError(
biotite.structure.io.pdbx.DeserializationError: Failed to deserialize category 'entity'

$ grep '^_entity.id' -A 20 /tmp/1n5m.cif 
_entity.id 
_entity.type 
_entity.src_method 
_entity.pdbx_description 
_entity.formula_weight 
_entity.pdbx_number_of_molecules 
_entity.pdbx_ec 
_entity.pdbx_mutation 
_entity.pdbx_fragment 
_entity.details 
1 polymer     man acetylcholinesterase                                                     59592.309 2   3.1.1.7 ? 
'CATALYTIC DOMAIN' ? 
2 branched    man 'alpha-L-fucopyranose-(1-6)-2-acetamido-2-deoxy-beta-D-glucopyranose'    367.349   1   ?       ? ? ? 
3 non-polymer syn 'IODIDE ION'                                                             126.904   10  ?       ? ? ? 
4 non-polymer syn 'HEXAETHYLENE GLYCOL'                                                    282.331   1   ?       ? ? ? 
5 non-polymer syn '2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]' 510.816   1   ?       ? ? ? 
6 non-polymer man 2-acetamido-2-deoxy-beta-D-glucopyranose                                 221.208   1   ?       ? ? ? 
7 non-polymer syn 'CARBONATE ION'                                                          60.009    1   ?       ? ? ? 
8 non-polymer syn 'TETRAETHYLENE GLYCOL'                                                   194.226   1   ?       ? ? ? 
9 water       nat water                                                                    18.015    551 ?       ? ? ? 
#                 1 
padix-key commented 1 month ago

First I thought, the problem is an error in quote escaping on the side of the RCSB. However, then I looked into the CIF specification:

  1. Matching single or double quote characters (' or ") may be used to bound a string representing a non-simple data value provided the string does not extend over more than one line.

  2. Because data values are invariably separated from other tokens in the file by white space, such a quote-delimited character string may contain instances of the character used to delimit the string provided they are not followed by white space. For example, the data item _example 'a dog's life' is legal; the data value is a dog's life.

  3. Note that constructs such as 'an embedded \' quote' do not behave as in the case of many current programming languages; i.e. the backslash character in this context does not escape the special meaning of the delimiter character. A backslash preceding the apostrophe or double-quote characters does, however, have special meaning in the context of accented characters (paragraph 32 of the document Common semantic features) provided there is no white space immediately following the apostrophe or double-quote character.

This means the quote escaping using the shlex module in biotite.io.pdbx is wrong. This means

https://github.com/biotite-dev/biotite/blob/2e053cb71f09e818cec02ce01e4b4056d5bdd08d/src/biotite/structure/io/pdbx/cif.py#L450

https://github.com/biotite-dev/biotite/blob/2e053cb71f09e818cec02ce01e4b4056d5bdd08d/src/biotite/structure/io/pdbx/cif.py#L476-L494

https://github.com/biotite-dev/biotite/blob/2e053cb71f09e818cec02ce01e4b4056d5bdd08d/src/biotite/structure/io/pdbx/cif.py#L974

https://github.com/biotite-dev/biotite/blob/2e053cb71f09e818cec02ce01e4b4056d5bdd08d/src/biotite/structure/io/pdbx/cif.py#L974-L978

need to be replaced. For splitting re.split() should work instead of shlex.split(), if you find some robust pattern. shlex.quote() can probably be replaced by

https://github.com/biotite-dev/biotite/blob/2e053cb71f09e818cec02ce01e4b4056d5bdd08d/src/biotite/structure/io/pdbx/cif.py#L995-L1012

padix-key commented 1 month ago

Should I assign the issue to you then?

0ut0fcontrol commented 1 month ago

Sure, please assign it to me.