Open andrewkesper opened 1 month ago
Thank you for releasing such a fantastic library. 😄
In Talisman 1.1.4, the Eudex phonetic algorithm seems to get stuck in an infinite loop when hashing a word containing characters such as [, ] and _.
I discovered this when I hit the [ key by accident whilst experimenting with the Eudex algorithm in the documentation.
To test this further, I created a Node.js script that hashed words containing every printable character code up to 0xFF. Each word was hashed in a worker thread which was terminated if there was no response after 500ms.
0xFF
const { Worker, isMainThread, parentPort, workerData } = require('node:worker_threads'); const eudex = require('talisman/phonetics/eudex'); if (isMainThread) { (async () => { const charCodeRanges = [ [0x20, 0x7E], [0xA0, 0xFF] ]; for (const [minCharCode, maxCharCode] of charCodeRanges) { for (let charCode=minCharCode; charCode<=maxCharCode; charCode++) { await new Promise(resolve => { const test = 'word'+String.fromCharCode(charCode); const worker = new Worker(__filename, { workerData: test }); worker.postMessage(test); const timeout = setTimeout(() => { worker.terminate(); console.error('timeout', charCode.toString(16), test); resolve(); }, 500); worker.on('message', message => { clearTimeout(timeout); console.log('ok', charCode.toString(16), ...message); resolve(); }); }); } } })(); } else { // worker thread parentPort.postMessage([workerData, eudex(workerData).toString()]); }
ok 20 word 41240 ok 21 word! 41240 ok 22 word" 41240 ok 23 word# 41240 ok 24 word$ 41240 ok 25 word% 41240 ok 26 word& 41240 ok 27 word' 41240 ok 28 word( 41240 ok 29 word) 41240 ok 2a word* 41240 ok 2b word+ 41240 ok 2c word, 41240 ok 2d word- 41240 ok 2e word. 41240 ok 2f word/ 41240 ok 30 word0 41240 ok 31 word1 41240 ok 32 word2 41240 ok 33 word3 41240 ok 34 word4 41240 ok 35 word5 41240 ok 36 word6 41240 ok 37 word7 41240 ok 38 word8 41240 ok 39 word9 41240 ok 3a word: 41240 ok 3b word; 41240 ok 3c word< 41240 ok 3d word= 41240 ok 3e word> 41240 ok 3f word? 41240 ok 40 word@ 41240 ok 41 wordA 10557440 ok 42 wordB 10557512 ok 43 wordC 10557452 ok 44 wordD 41240 ok 45 wordE 10557440 ok 46 wordF 10557508 ok 47 wordG 10557448 ok 48 wordH 10557444 ok 49 wordI 10557441 ok 4a wordJ 10557445 ok 4b wordK 10557449 ok 4c wordL 10557600 ok 4d wordM 10557442 ok 4e wordN 10557458 ok 4f wordO 10557440 ok 50 wordP 10557513 ok 51 wordQ 10557608 ok 52 wordR 10557601 ok 53 wordS 10557460 ok 54 wordT 10557469 ok 55 wordU 10557441 ok 56 wordV 10557509 ok 57 wordW 10557440 ok 58 wordX 10557572 ok 59 wordY 10557441 ok 5a wordZ 10557588 timeout 5b word[ timeout 5c word\ timeout 5d word] timeout 5e word^ timeout 5f word_ ok 60 word` 41240 ok 61 worda 10557440 ok 62 wordb 10557512 ok 63 wordc 10557452 ok 64 wordd 41240 ok 65 worde 10557440 ok 66 wordf 10557508 ok 67 wordg 10557448 ok 68 wordh 10557444 ok 69 wordi 10557441 ok 6a wordj 10557445 ok 6b wordk 10557449 ok 6c wordl 10557600 ok 6d wordm 10557442 ok 6e wordn 10557458 ok 6f wordo 10557440 ok 70 wordp 10557513 ok 71 wordq 10557608 ok 72 wordr 10557601 ok 73 words 10557460 ok 74 wordt 10557469 ok 75 wordu 10557441 ok 76 wordv 10557509 ok 77 wordw 10557440 ok 78 wordx 10557572 ok 79 wordy 10557441 ok 7a wordz 10557588 timeout 7b word{ timeout 7c word| timeout 7d word} timeout 7e word~ timeout a0 word timeout a1 word¡ timeout a2 word¢ timeout a3 word£ timeout a4 word¤ timeout a5 word¥ timeout a6 word¦ timeout a7 word§ timeout a8 word¨ timeout a9 word© timeout aa wordª timeout ab word« timeout ac word¬ timeout ad word timeout ae word® timeout af word¯ timeout b0 word° timeout b1 word± timeout b2 word² timeout b3 word³ timeout b4 word´ timeout b5 wordµ timeout b6 word¶ timeout b7 word· timeout b8 word¸ timeout b9 word¹ timeout ba wordº timeout bb word» timeout bc word¼ timeout bd word½ timeout be word¾ timeout bf word¿ ok c0 wordÀ 41240 ok c1 wordÁ 41240 ok c2 word 41240 ok c3 wordà 41240 ok c4 wordÄ 41240 ok c5 wordÅ 41240 ok c6 wordÆ 41240 ok c7 wordÇ 41240 ok c8 wordÈ 41240 ok c9 wordÉ 41240 ok ca wordÊ 41240 ok cb wordË 41240 ok cc wordÌ 41240 ok cd wordÍ 41240 ok ce wordÎ 41240 ok cf wordÏ 41240 ok d0 wordÐ 41240 ok d1 wordÑ 41240 ok d2 wordÒ 41240 ok d3 wordÓ 41240 ok d4 wordÔ 41240 ok d5 wordÕ 41240 ok d6 wordÖ 41240 ok d7 word× 41240 ok d8 wordØ 41240 ok d9 wordÙ 41240 ok da wordÚ 41240 ok db wordÛ 41240 ok dc wordÜ 41240 ok dd wordÝ 41240 ok de wordÞ 41240 ok df wordß 41240 ok e0 wordà 41240 ok e1 wordá 41240 ok e2 wordâ 41240 ok e3 wordã 41240 ok e4 wordä 41240 ok e5 wordå 41240 ok e6 wordæ 41240 ok e7 wordç 41240 ok e8 wordè 41240 ok e9 wordé 41240 ok ea wordê 41240 ok eb wordë 41240 ok ec wordì 41240 ok ed wordí 41240 ok ee wordî 41240 ok ef wordï 41240 ok f0 wordð 41240 ok f1 wordñ 41240 ok f2 wordò 41240 ok f3 wordó 41240 ok f4 wordô 41240 ok f5 wordõ 41240 ok f6 wordö 41240 ok f7 word÷ 41240 ok f8 wordø 41240 ok f9 wordù 41240 ok fa wordú 41240 ok fb wordû 41240 ok fc wordü 41240 ok fd wordý 41240 ok fe wordþ 41240 ok ff wordÿ 41240
Thanks @andrewkesper, I'll try to fix this when I can.
Thank you for releasing such a fantastic library. 😄
In Talisman 1.1.4, the Eudex phonetic algorithm seems to get stuck in an infinite loop when hashing a word containing characters such as [, ] and _.
I discovered this when I hit the [ key by accident whilst experimenting with the Eudex algorithm in the documentation.
To test this further, I created a Node.js script that hashed words containing every printable character code up to
0xFF
. Each word was hashed in a worker thread which was terminated if there was no response after 500ms.Node.js script
Output