Closed AlexTWeb closed 2 years ago
Thank you very much for this feedback! Could you provide us with a screenshot exemplifying this issue?
Thanks @Akron for looking into this.
This is a screenshot with the default table.js, as you see only the number 2 is being displayed - Using the docker-compose file from https://github.com/KorAP/KorAP-Docker, so Kalamar/latest-conv
When applying the following patch to table.js, the display gets better:
--- table.js.org 2022-06-13 14:52:07.704926613 +0000
+++ table.js 2022-06-13 14:52:31.032934908 +0000
@@ -184,7 +184,7 @@
// Leaf node
// store string on position and go to next string
else if (c.nodeType === 3) {
- if (c.nodeValue.match(/[a-z0-9\u25ae]/iu)) {
+ if (c.nodeValue.match(/[^\s]/iu)) {
t._mark[t._pos] = mark ? true : false;
t._token[t._pos++] = c.nodeValue;
};
(From a slightly older version, not docker based, as faster to change the table.js)
Thank you very much! This is very helpful to prepare a test for this case! We'll look into this!
Dear KorAP team,
First of all many thanks for having spend so much time and effort into these very useful corpus tools!
I've just spotted an issue when displaying the Tokens of non-latin documents (ie Chinese, Russian, ...) - the culprit seems to be this very restrictive regex: Table.js L187
When widening that regex to only filter-out whitespaces
/[^\s]/
instead, it seems to improve the situation with non-latin docs, but this might have other side effects, I'm not aware of.Perhaps if you have time to consider checking and improving this, it would be much appreciated!
Thanks & best, Alex