Too strong filtering of Tokens in Leaf Nodes

AlexTWeb commented 2 years ago

Dear KorAP team,

First of all many thanks for having spend so much time and effort into these very useful corpus tools!

I've just spotted an issue when displaying the Tokens of non-latin documents (ie Chinese, Russian, ...) - the culprit seems to be this very restrictive regex: Table.js L187

When widening that regex to only filter-out whitespaces /[^\s]/ instead, it seems to improve the situation with non-latin docs, but this might have other side effects, I'm not aware of.

Perhaps if you have time to consider checking and improving this, it would be much appreciated!

Thanks & best, Alex

Akron commented 2 years ago

Thank you very much for this feedback! Could you provide us with a screenshot exemplifying this issue?

AlexTWeb commented 2 years ago

Thanks @Akron for looking into this.

This is a screenshot with the default table.js, as you see only the number 2 is being displayed - Using the docker-compose file from https://github.com/KorAP/KorAP-Docker, so Kalamar/latest-conv

When applying the following patch to table.js, the display gets better:

--- table.js.org    2022-06-13 14:52:07.704926613 +0000
+++ table.js    2022-06-13 14:52:31.032934908 +0000
@@ -184,7 +184,7 @@
         // Leaf node
         // store string on position and go to next string
         else if (c.nodeType === 3) {
-          if (c.nodeValue.match(/[a-z0-9\u25ae]/iu)) {
+          if (c.nodeValue.match(/[^\s]/iu)) {
             t._mark[t._pos] = mark ? true : false;
             t._token[t._pos++] = c.nodeValue;
           };

(From a slightly older version, not docker based, as faster to change the table.js)

Akron commented 2 years ago

Thank you very much! This is very helpful to prepare a test for this case! We'll look into this!

KorAP / Kalamar

Too strong filtering of Tokens in Leaf Nodes #168