KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

add serialization/deserialization for DNMRanges/XPointers #7

Closed jfschaefer closed 7 years ago

jfschaefer commented 7 years ago

Overview

I added serialization/deserialization of DNMRanges/XPointers in the way which is also planned for KAT. Just as a short reminder: Tom suggested to use arange and string-index. In the test case, the following XPointer is generated: arange(string-index(//body[1]/text()[4],10),string-index(//body[1]/text()[4],13)) For deserialization, unfortunately, only XPointers generated with the serialization code are supported. So e.g. adding white-spaces will probably break the code. This should be improved in the future.

List of Changes

Issues

jfschaefer commented 7 years ago

Update: The fix for the UTF-8 issue is now also in here

I didn't consider that the PR would get updated once I push the changes to my master branch. Now there is a second commit that should fix at least many of the UTF-8 issues.

The original problem

We did not always distinguish between byte offsets and character offsets. As long as we have ASCII characters (i.e. we use the unicode normalization) this shouldn't be a problem. Otherwise, we can get runtime errors e.g. when calling DNMRange::get_subrange, because it takes byte offsets. Also, the whole DNM generation failed for the back mapping in case there were some Unicode characters.

The fix

Use a vector of characters instead during the DNM generation. Afterwards, the plaintext is generated from it as a String along with a vector that maps character offsets to byte offsets. This still allows us to efficiently get a substring of the original plaintext (e.g. DNMRange::get_plaintext). I also added a methodDNMRange::get_subrange_from_byte_offsets`, because Senna returns byte offsets. As far as I can tell Senna seems to accept UTF-8, but it considers all non-ASCII characters as white space.