dkpro / dkpro-jwpl

DKPro JWPL (DKPro Java Wikipedia Library) is a free, Java-based application programming interface that facilitates access to all information in Wikipedia.
https://dkpro.github.io/dkpro-jwpl
Apache License 2.0
82 stars 34 forks source link

Handling of surrogate characters in Revisionmachine #282

Open rzo1 opened 10 months ago

rzo1 commented 10 months ago

Copy pasted from the README.

There are 4 possible modes of handling UTF8 surrogate characters.
Currently, the only reliable mode is "Discard Revision", in which any revision that contains surrogate characters is discarded.
The other three modes in "org.dkpro.jwpl.revisionmachine.difftool.data.SurrogateModes" have been disabled for now.
The corresponding config-section in the config tool has also been made invisible (org.dkpro.jwpl.revisionmachine.difftool.config.gui.panels.InputPanel)
The disabled parts are marked with TODO-markers

In order to use the other three surrogate modes, which try to handle surrogate characters differently,
the corresponding code has to be checked. Afterwards, the modes can be reenables in the config tool (InputPanel.java) and the SurrogateModes-class