Closed jfschaefer closed 7 years ago
I didn't consider that the PR would get updated once I push the changes to my master branch. Now there is a second commit that should fix at least many of the UTF-8 issues.
We did not always distinguish between byte offsets and character offsets. As long as we have ASCII characters (i.e. we use the unicode normalization) this shouldn't be a problem.
Otherwise, we can get runtime errors e.g. when calling DNMRange::get_subrange
, because it takes byte offsets.
Also, the whole DNM generation failed for the back mapping in case there were some Unicode characters.
Use a vector of characters instead during the DNM generation. Afterwards, the plaintext is generated from it as a String
along with a vector that maps character offsets to byte offsets.
This still allows us to efficiently get a substring of the original plaintext (e.g. DNMRange::get_plaintext). I also added a method
DNMRange::get_subrange_from_byte_offsets`, because Senna returns byte offsets. As far as I can tell Senna seems to accept UTF-8, but it considers all non-ASCII characters as white space.
Overview
I added serialization/deserialization of DNMRanges/XPointers in the way which is also planned for KAT. Just as a short reminder: Tom suggested to use
arange
andstring-index
. In the test case, the following XPointer is generated:arange(string-index(//body[1]/text()[4],10),string-index(//body[1]/text()[4],13))
For deserialization, unfortunately, only XPointers generated with the serialization code are supported. So e.g. adding white-spaces will probably break the code. This should be improved in the future.List of Changes
DNMParameters::move_whitespaces_between_nodes
option. It was already incompatible with many settings (and therefore never used) and it would cause more problems with the changes. It is much better to useDNMRange::trim
, which serves a similar purpose.DNMParamters::support_back_mapping
option. Currently it's set to false by default. It populatesDNM::back_map
during the generation. For each character, the node and the offset inside the node are stored in this vector. Currently, this option is incompatible with the word stemming, because the word stemming changes the offsets in a non-obvious way. It would be a nice project to support it for word stemming as well.DNMRange::serialize
andDNMRange::deserialize
Issues