I have a large XML file with many residues. At some point, adding one more residue breaks the parsing as can be seen here in the case of water H1 atom:
2022-01-18 11:51:35,867 - pdb2pqr.definitions - INFO - Got text for <name>: H1
2022-01-18 11:51:35,867 - pdb2pqr.definitions - INFO - Got text for <altname>: HW
2022-01-18 11:51:35,867 - pdb2pqr.definitions - INFO - Got text for <altname>: HH1
2022-01-18 11:51:35,867 - pdb2pqr.definitions - INFO - Got text for <altname>: 1H
2022-01-18 11:51:35,867 - pdb2pqr.definitions - INFO - Got text for <x>: 2.865
2022-01-18 11:51:35,867 - pdb2pqr.definitions - INFO - Got text for <y>: 56.
2022-01-18 11:51:35,867 - pdb2pqr.definitions - INFO - Got text for <y>: 756
When that happens the current DefinitionHandler drops the 56. and keeps as y coordinate of the H1 atom 756. And then all hell breaks loose with water atoms flying into space.
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
In a similar issue on SO https://stackoverflow.com/a/19793186/1198173 the answer suggests to accumulate the character data and only parse it on the end of the element which makes sense to me.
I have a large XML file with many residues. At some point, adding one more residue breaks the parsing as can be seen here in the case of water H1 atom:
When that happens the current DefinitionHandler drops the
56.
and keeps as y coordinate of the H1 atom756
. And then all hell breaks loose with water atoms flying into space.Looking at the sax parser docs we can find the following interesting snippet: https://docs.python.org/3.8/library/xml.sax.handler.html#xml.sax.handler.ContentHandler.characters
In a similar issue on SO https://stackoverflow.com/a/19793186/1198173 the answer suggests to accumulate the character data and only parse it on the end of the element which makes sense to me.
I'll try to make a fix for this ASAP. Let's delay the release https://github.com/Electrostatics/pdb2pqr/issues/292 until this is resolved