Overview

I added serialization/deserialization of DNMRanges/XPointers in the way which is also planned for KAT. Just as a short reminder: Tom suggested to use arange and string-index. In the test case, the following XPointer is generated: arange(string-index(//body[1]/text()[4],10),string-index(//body[1]/text()[4],13)) For deserialization, unfortunately, only XPointers generated with the serialization code are supported. So e.g. adding white-spaces will probably break the code. This should be improved in the future.

List of Changes

I removed the DNMParameters::move_whitespaces_between_nodes option. It was already incompatible with many settings (and therefore never used) and it would cause more problems with the changes. It is much better to use DNMRange::trim, which serves a similar purpose.
There is a new DNMParamters::support_back_mapping option. Currently it's set to false by default. It populates DNM::back_map during the generation. For each character, the node and the offset inside the node are stored in this vector. Currently, this option is incompatible with the word stemming, because the word stemming changes the offsets in a non-obvious way. It would be a nice project to support it for word stemming as well.
And of course there are now the methods DNMRange::serialize and DNMRange::deserialize

Issues

So far the, the code is not tested very much. I've reimplemented the pattern matching code from scratch, which uses the back mapping already (first PR should follow soon). It can serve for more complex tests in the future, in particular, once KAT also supports this representation.
Rust strings are UTF-8 encoded (and we don't distinguish byte and character offsets properly). If we apply unicode normalization this shouldn't be a problem, but otherwise we get runtime errors. I just implemented a fix for it and will create a second PR once this one is done.

Update: The fix for the UTF-8 issue is now also in here

I didn't consider that the PR would get updated once I push the changes to my master branch. Now there is a second commit that should fix at least many of the UTF-8 issues.

The original problem

We did not always distinguish between byte offsets and character offsets. As long as we have ASCII characters (i.e. we use the unicode normalization) this shouldn't be a problem. Otherwise, we can get runtime errors e.g. when calling DNMRange::get_subrange, because it takes byte offsets. Also, the whole DNM generation failed for the back mapping in case there were some Unicode characters.

The fix

Use a vector of characters instead during the DNM generation. Afterwards, the plaintext is generated from it as a String along with a vector that maps character offsets to byte offsets. This still allows us to efficiently get a substring of the original plaintext (e.g. DNMRange::get_plaintext). I also added a methodDNMRange::get_subrange_from_byte_offsets`, because Senna returns byte offsets. As far as I can tell Senna seems to accept UTF-8, but it considers all non-ASCII characters as white space.

KWARC / llamapun

add serialization/deserialization for DNMRanges/XPointers #7