Open 0ut0fcontrol opened 1 month ago
First I thought, the problem is an error in quote escaping on the side of the RCSB. However, then I looked into the CIF specification:
Matching single or double quote characters (' or ") may be used to bound a string representing a non-simple data value provided the string does not extend over more than one line.
Because data values are invariably separated from other tokens in the file by white space, such a quote-delimited character string may contain instances of the character used to delimit the string provided they are not followed by white space. For example, the data item
_example 'a dog's life'
is legal; the data value is a dog's life.Note that constructs such as 'an embedded \' quote' do not behave as in the case of many current programming languages; i.e. the backslash character in this context does not escape the special meaning of the delimiter character. A backslash preceding the apostrophe or double-quote characters does, however, have special meaning in the context of accented characters (paragraph 32 of the document Common semantic features) provided there is no white space immediately following the apostrophe or double-quote character.
This means the quote escaping using the shlex
module in biotite.io.pdbx
is wrong.
This means
need to be replaced. For splitting re.split()
should work instead of shlex.split()
, if you find some robust pattern. shlex.quote()
can probably be replaced by
Should I assign the issue to you then?
Sure, please assign it to me.
1n5m.cif
pdbx_description in entity category has no closing quotation:'2,2',2"-[1,2,3-BENZENE-TRIYLTRIS(OXY)]TRIS[N,N,N-TRIETHYLETHANAMINIUM]'
There are a total of 3 problematic examples in the wwPDB database: 1n5m, 6szp, 1tsl
I can fix it, but I'm not sure about the best way to do so. Do you have any suggestions?
code and trackback