Doccano tokenizer is inconsistent with udpipe tokenizer

cregouby / RGPD_facile_avec_R

R User Group Toulouse conference

Apache License 2.0

1 stars 0 forks source link

Doccano tokenizer is inconsistent with udpipe tokenizer #1

Open cregouby opened 4 years ago

cregouby commented 4 years ago

...so nnotation reconciliation with tokens fails

jwijffels commented 4 years ago

Hi @cregouby Also just set up a doccano server to annotate some examples. I'm also noticing differences in output of the text spans. Do you know what is the cause of doccano providing wrong spans of start_offset and end_offset?

cregouby commented 4 years ago

Hi @jwijffels I suspect a difference in the treatment of UTF8 special space character, as the text here - raw output of tika - can be very noisy, i.e. compared to polished examples. I notice that the latest version of doccano did reduce the surface of the issue. This requires further investigation. I must admit I don't have a Reprex here.

jwijffels commented 4 years ago

In my data even an ascii text gives a start vanlue of 0 up to the end value which is nchar(text). That means there is already one character too much. I think it is related to the slice javascript function as well next to how to display special characters in html