Yomguithereal / talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
https://yomguithereal.github.io/talisman/
MIT License
704 stars 47 forks source link

Some characters trigger infinite loop in Eudex phonetic algorithm #187

Open andrewkesper opened 1 month ago

andrewkesper commented 1 month ago

Thank you for releasing such a fantastic library. 😄

In Talisman 1.1.4, the Eudex phonetic algorithm seems to get stuck in an infinite loop when hashing a word containing characters such as [, ] and _.

I discovered this when I hit the [ key by accident whilst experimenting with the Eudex algorithm in the documentation.

To test this further, I created a Node.js script that hashed words containing every printable character code up to 0xFF. Each word was hashed in a worker thread which was terminated if there was no response after 500ms.

Node.js script

const { Worker, isMainThread, parentPort, workerData } = require('node:worker_threads');
const eudex = require('talisman/phonetics/eudex');
if (isMainThread) {
  (async () => {
    const charCodeRanges = [ [0x20, 0x7E], [0xA0, 0xFF] ];
    for (const [minCharCode, maxCharCode] of charCodeRanges) {
      for (let charCode=minCharCode; charCode<=maxCharCode; charCode++) {
        await new Promise(resolve => {
          const test = 'word'+String.fromCharCode(charCode);
          const worker = new Worker(__filename, { workerData: test });
          worker.postMessage(test);
          const timeout = setTimeout(() => {
            worker.terminate();
            console.error('timeout', charCode.toString(16), test);
            resolve();
          }, 500);
          worker.on('message', message => {
            clearTimeout(timeout);
            console.log('ok', charCode.toString(16), ...message);
            resolve();
          });
        });
      }
    }
  })();
}
else { // worker thread
  parentPort.postMessage([workerData, eudex(workerData).toString()]);
}

Output

ok 20 word  41240
ok 21 word! 41240
ok 22 word" 41240
ok 23 word# 41240
ok 24 word$ 41240
ok 25 word% 41240
ok 26 word& 41240
ok 27 word' 41240
ok 28 word( 41240
ok 29 word) 41240
ok 2a word* 41240
ok 2b word+ 41240
ok 2c word, 41240
ok 2d word- 41240
ok 2e word. 41240
ok 2f word/ 41240
ok 30 word0 41240
ok 31 word1 41240
ok 32 word2 41240
ok 33 word3 41240
ok 34 word4 41240
ok 35 word5 41240
ok 36 word6 41240
ok 37 word7 41240
ok 38 word8 41240
ok 39 word9 41240
ok 3a word: 41240
ok 3b word; 41240
ok 3c word< 41240
ok 3d word= 41240
ok 3e word> 41240
ok 3f word? 41240
ok 40 word@ 41240
ok 41 wordA 10557440
ok 42 wordB 10557512
ok 43 wordC 10557452
ok 44 wordD 41240
ok 45 wordE 10557440
ok 46 wordF 10557508
ok 47 wordG 10557448
ok 48 wordH 10557444
ok 49 wordI 10557441
ok 4a wordJ 10557445
ok 4b wordK 10557449
ok 4c wordL 10557600
ok 4d wordM 10557442
ok 4e wordN 10557458
ok 4f wordO 10557440
ok 50 wordP 10557513
ok 51 wordQ 10557608
ok 52 wordR 10557601
ok 53 wordS 10557460
ok 54 wordT 10557469
ok 55 wordU 10557441
ok 56 wordV 10557509
ok 57 wordW 10557440
ok 58 wordX 10557572
ok 59 wordY 10557441
ok 5a wordZ 10557588
timeout 5b word[
timeout 5c word\
timeout 5d word]
timeout 5e word^
timeout 5f word_
ok 60 word` 41240
ok 61 worda 10557440
ok 62 wordb 10557512
ok 63 wordc 10557452
ok 64 wordd 41240
ok 65 worde 10557440
ok 66 wordf 10557508
ok 67 wordg 10557448
ok 68 wordh 10557444
ok 69 wordi 10557441
ok 6a wordj 10557445
ok 6b wordk 10557449
ok 6c wordl 10557600
ok 6d wordm 10557442
ok 6e wordn 10557458
ok 6f wordo 10557440
ok 70 wordp 10557513
ok 71 wordq 10557608
ok 72 wordr 10557601
ok 73 words 10557460
ok 74 wordt 10557469
ok 75 wordu 10557441
ok 76 wordv 10557509
ok 77 wordw 10557440
ok 78 wordx 10557572
ok 79 wordy 10557441
ok 7a wordz 10557588
timeout 7b word{
timeout 7c word|
timeout 7d word}
timeout 7e word~
timeout a0 word 
timeout a1 word¡
timeout a2 word¢
timeout a3 word£
timeout a4 word¤
timeout a5 word¥
timeout a6 word¦
timeout a7 word§
timeout a8 word¨
timeout a9 word©
timeout aa wordª
timeout ab word«
timeout ac word¬
timeout ad word­
timeout ae word®
timeout af word¯
timeout b0 word°
timeout b1 word±
timeout b2 word²
timeout b3 word³
timeout b4 word´
timeout b5 wordµ
timeout b6 word¶
timeout b7 word·
timeout b8 word¸
timeout b9 word¹
timeout ba wordº
timeout bb word»
timeout bc word¼
timeout bd word½
timeout be word¾
timeout bf word¿
ok c0 wordÀ 41240
ok c1 wordÁ 41240
ok c2 word 41240
ok c3 wordà 41240
ok c4 wordÄ 41240
ok c5 wordÅ 41240
ok c6 wordÆ 41240
ok c7 wordÇ 41240
ok c8 wordÈ 41240
ok c9 wordÉ 41240
ok ca wordÊ 41240
ok cb wordË 41240
ok cc wordÌ 41240
ok cd wordÍ 41240
ok ce wordÎ 41240
ok cf wordÏ 41240
ok d0 wordÐ 41240
ok d1 wordÑ 41240
ok d2 wordÒ 41240
ok d3 wordÓ 41240
ok d4 wordÔ 41240
ok d5 wordÕ 41240
ok d6 wordÖ 41240
ok d7 word× 41240
ok d8 wordØ 41240
ok d9 wordÙ 41240
ok da wordÚ 41240
ok db wordÛ 41240
ok dc wordÜ 41240
ok dd wordÝ 41240
ok de wordÞ 41240
ok df wordß 41240
ok e0 wordà 41240
ok e1 wordá 41240
ok e2 wordâ 41240
ok e3 wordã 41240
ok e4 wordä 41240
ok e5 wordå 41240
ok e6 wordæ 41240
ok e7 wordç 41240
ok e8 wordè 41240
ok e9 wordé 41240
ok ea wordê 41240
ok eb wordë 41240
ok ec wordì 41240
ok ed wordí 41240
ok ee wordî 41240
ok ef wordï 41240
ok f0 wordð 41240
ok f1 wordñ 41240
ok f2 wordò 41240
ok f3 wordó 41240
ok f4 wordô 41240
ok f5 wordõ 41240
ok f6 wordö 41240
ok f7 word÷ 41240
ok f8 wordø 41240
ok f9 wordù 41240
ok fa wordú 41240
ok fb wordû 41240
ok fc wordü 41240
ok fd wordý 41240
ok fe wordþ 41240
ok ff wordÿ 41240
Yomguithereal commented 1 week ago

Thanks @andrewkesper, I'll try to fix this when I can.