arunsupe / semantic-grep

grep for words with similar meaning to the query
MIT License
1.12k stars 26 forks source link

Not working with Chinese model cc.zh.300.bin (from cc.zh.300.vec.gz) on Windows 10 #14

Closed AnabasisXu closed 4 months ago

AnabasisXu commented 4 months ago

First of all, thank you so much for multi language support.

I have followed the instruction by doing 4 things. No errors was reported, but it did not work with Chinese.

I have prepared a 1.txt file with UTF-8 conding and made sure that grep command will find the keyword in the file.

Interestingly, the Chinese model works for the English test.

  1. Build fasttext-to-bin.go
โฏ go build fasttext-to-bin.go
  1. Conversion to cc.zh.300.bin
โฏ gunzip -c cc.zh.300.vec.gz | ./fasttext-to-bin -input - -output ./cc.zh.300.bin
Conversion complete. Word2Vec binary model saved as ./cc.zh.300.bin
  1. Change the config.json
Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1 took 53s
โฏ cat "G:\CS\cml\semantic-grep\config.json"
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
       โ”‚ File: G:\CS\cml\semantic-grep\config.json
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1   โ”‚ {
   2   โ”‚     "model_path": "G:/CS/cml/semantic-grep/cc.zh.300.bin"
   3   โ”‚ }
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  1. Tests

Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1
โฏ ./sgrep  -C 2 -n -threshold 0.55 'ๅˆ็†ๆ€ง' 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json
โฏ grep 'ๅˆ็†ๆ€ง' 1.txt
"ๆ›ดๅคš็š„่ฏๆฎ "็š„ไธ€็งๅฏ่ƒฝๆ€งๆ˜ฏๅฏนๅˆ็†ๆผ”็ปŽๆŽจๆ–ญ็š„ไฟ็œŸๅฑžๆ€ง็š„ไธ€็ง็ฑปๆŽจ๏ผšๆญฃๅฆ‚ๅœจ็œŸๅฎž็š„ๅ‰ๆไธ‹่ฟ›่กŒๅˆ็†ๆผ”็ปŽๆœฌ่บซไนŸๅฏไปฅ็กฎไฟไธบ็œŸไธ€ๆ ท๏ผŒไปŽ็œŸๅฎž็š„่ง‚ๅฏŸ่ฟ›่กŒ็š„ๅˆ็†ๅฝ’็บณไนŸๅบ”่ฏฅๆ˜ฏ็œŸๅฎž็š„๏ผŒ่‡ณๅฐ‘ๅœจ่ฏๆฎไธๆ–ญๅขžๅŠ ็š„้™ๅˆถไธ‹ๅบ”่ฏฅๅฆ‚ๆญคใ€‚็„ถ่€Œ๏ผŒ่ฟ™ๅชๆ˜ฏๅฏนๆˆ‘ไปฌ็š„ๆŽจๆ–ญ็จ‹ๅบๆ˜ฏๅ…ทๆœ‰่ฟž็ปญๆ€ง็š„ๅŸบๆœฌ่ฆๆฑ‚ใ€‚ๅฆ‚ไธŠๆ–‡ๆ‰€่ฟฐ๏ผŒไฝฟ็”จ่ดๅถๆ–ฏๆณ•ๅˆ™ๅนถไธๆ˜ฏ็กฎไฟไธ€่‡ดๆ€ง็š„ๅ……ๅˆ†ๆกไปถ๏ผŒไนŸไธๆ˜ฏๅฟ…่ฆๆกไปถใ€‚ไบ‹ๅฎžไธŠ๏ผŒๆˆ‘ไปฌๆ‰€็Ÿฅ้“็š„ๆฏไธ€ไธชๅ…ณไบŽ่ดๅถๆ–ฏไธ€่‡ดๆ€ง็š„่ฏๆ˜Ž๏ผŒ่ฆไนˆๆ˜ฏๅ‡่ฎพๅฏนๅŒไธ€้—ฎ้ข˜ๆœ‰ไธ€ไธชๅ…ทๆœ‰ไธ€่‡ดๆ€ง็š„้ž่ดๅถๆ–ฏ็จ‹ๅบ๏ผŒ่ฆไนˆๆ˜ฏๅšไบ†ๅ…ถไป–็š„ๅ‡่ฎพ่€Œ่ฟ™ไบ›ๅ‡่ฎพไธญๅŒ…ๅซไบ†่ฟ™ๆ ทไธ€ไธชๅ…ทๆœ‰ไธ€่‡ดๆ€ง็š„้ž่ดๅถๆ–ฏ็จ‹ๅบ็š„ๅญ˜ๅœจใ€‚ๅœจไปปไฝ•ๆƒ…ๅ†ตไธ‹๏ผŒๅปบ็ซ‹ไบ†็ปŸ่ฎก็จ‹ๅบไธ€่‡ดๆ€ง็š„ๅฎš็†้ƒฝไผš็กฎไฟ่ฟ™ไบ›็จ‹ๅบ็š„ๆผ”็ปŽๅˆ็†ๆ€ง
  1. Test the Chinese model by sgrep the word glory in hm.txt (Old Man & Sea)
โฏ ./sgrep  -C 2 -n -threshold 0.55 glory hm.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json
Similarity: 1.0000
1463:not know he was so big."
1464:
1465:"I'll kill him though," he said.  "In all his greatness and his glory."
1466:
1467:Although it is unjust, he thought.  But I will show him what a man can
--
arunsupe commented 4 months ago

Thanks for trying out sgrep w2vgrep.

For your issue, sounds like the program is using the wrong model. Could you please try the following:

  1. Clone the latest repo (I fixed a few issues today)
  2. Double check that the model is being made correctly by fasttext-to-bin. The md5sum I am getting for the processed model is md5sum cc.zh.300.bin 67af1742fa5c1c0fe20cf68aa4447cfb
  3. Try running the program with the model path in the command-line: curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ็‡’

Please let me know if this fixes the issue. If not, I have to think harder.

AnabasisXu commented 4 months ago

Thanks! I updated to version 0.6, and have made sure that md5sum is right:

Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1
โฏ md5sum cc.zh.300.bin
67af1742fa5c1c0fe20cf68aa4447cfb *cc.zh.300.bin

The sample comand in your instruction worked curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ็‡’

However, it only shows results with Similarity: 1.0000. I have tried other Chinese characters and the result is all the same. So the model cc.zh.300.bin is not doing its job?

I also tried to use w2vgrep on local files without curl, but my commnds did not work. I expected it would at least show what a grep command will show, as the target word ๅˆ็†ๆ€ง is indded in the 1.txt.

Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1 took 52s
โฏ ./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ๅˆ็†ๆ€ง 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json

Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1 took 55s
โฏ cat 1.txt |./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ๅˆ็†ๆ€ง
Using configuration file: G:\CS\cml\semantic-grep\config.json

1.txt

arunsupe commented 4 months ago

The default threshold is 0.7, which may be high for your use case. Try lowering it: curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin --threshold=0.3 ็‡’. This is finding me (่ฆ‹ 0.4776), (้ ญ 0.4200, (้ถด 0.3543). I do not know the language to know if these are good matches.

To help troubleshoot the model, I added a synonym-finder.go to ./model_processing_utils/. This program will find similar words to the query word above any threshold in the model.

# build
cd model_processing_utils
go build synonym-finder.go

#run
synonym-finder -model_path path/to/cc.zh.300.bin -threshold 0.5 ๅˆ็†ๆ€ง

# Output:
Words similar to 'ๅˆ็†ๆ€ง' with similarity >= 0.50:
ๅฆฅๅฝ“ๆ€ง 0.5745
ๅ‘จๅปถๆ€ง 0.5535
ๅฎข่ง‚ๆ€ง 0.5030
ๅฏๆ“ไฝœๆ€ง 0.5053
ๅˆ็†ๆ€ง 1.0000
ไธ€่‡ดๆ€ง 0.5334
ๅฎŒๅ–„ๆ€ง 0.5656
ๅ…ฌๆญฃๆ€ง 0.5245
ๆ•ˆ็”จๆ€ง 0.5316
่ฏ็ซ‹ 0.5147
่‡ชๆดฝๆ€ง 0.5008
ๆญฃๅฝ“ๆ€ง 0.6018
ๅฟ…่ฆๆ€ง 0.6499
ๅ…ฌๅ…ๆ€ง 0.6152
ๅฏ่กŒๆ€ง 0.5923
ไธๅˆ็†ๆ€ง 0.6094
ๆ•ˆ็›Šๆ€ง 0.5529
ๅˆๆณ•ๆ€ง 0.6219
ๅบ”็„ถๆ€ง 0.5709
ไธๅˆ็† 0.5412
ๆญฃ็•ถๆ€ง่ˆ‡ 0.5173
ๆญฃ็กฎๆ€ง 0.5537
ๅˆ็† 0.5808
ๅฏๆŽฅๅ—ๆ€ง 0.5151
็ง‘ๅญฆๆ€ง 0.6304
่ฎบ่ฏ 0.5379
ๅฎž่ฏๆ€ง 0.5216
ๆœ‰ๆ•ˆๆ€ง 0.6374
ๅ…ฌๅนณๆ€ง 0.5250
ๅ‘จๅฏ†ๆ€ง 0.5292
ๅ……ๅˆ†ๆ€ง 0.5156
ๅปๅˆๆ€ง 0.5006
ๆฐๅฝ“ๆ€ง 0.5426
ๅฟ…็„ถๆ€ง 0.5574
้€‚ๅบฆๆ€ง 0.5401
็›ธไผผๆ€ง 0.5101
ๅฎŒๅค‡ๆ€ง 0.5060

If these don't solve your issue, please get back to me as before.

AnabasisXu commented 4 months ago

Hi arunsupe, thanks for developing synonym-finder! As shown in your reply, we can find quite a few synonyms of ๅˆ็†ๆ€ง that are above 0.5 threshold. Just show the top 4 of the result.

Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1
โฏ ./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 ๅˆ็†ๆ€ง
Words similar to 'ๅˆ็†ๆ€ง' with similarity >= 0.50:
ๅ‘จๅปถๆ€ง 0.5535
ๅฎŒๅ–„ๆ€ง 0.5656
ๅ‘จๅฏ†ๆ€ง 0.5292
ๅˆ็† 0.5808

Yet querying ๅˆ็†ๆ€ง against 1.txt with a threshold of 0.5 hits no match. Tried lowering to 0.3, it did show a lot of results, though the quality is terrible (basically equals grep ๆ€ง 1.txt).

Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1 took 52s
โฏ ./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin  --threshold=0.5 'ๅˆ็†ๆ€ง' 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json

But I have find a workaround by passing the result of synonym-finder to grep: I use awk to apply each synonym to a grep synonym 1.txt command, and if there is any result of grep, echo its corresponding similarity. As the results are long, I limit to one match. I would say the quality of matches is quite good.

Administrator in CS\cml\semantic-grep via ๐Ÿน v1.19.1 took 8s
โฏ ./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 ๅˆ็†ๆ€ง  | awk 'NR > 1 {print $1, $2}' | while read -r word similarity; do
    if grep --color=auto -q "$word" 1.txt; then
        echo "\"$word\" Similarity: $similarity"
        grep --color=auto "$word" 1.txt
        echo ""
    fi
done
"ๅฏๆ“ไฝœๆ€ง" Similarity: 0.5053
็”ฑไบŽๆˆ‘ไปฌๆŠŠๅ…ˆ้ชŒๅˆ†ๅธƒ็œ‹ไฝœๆ˜ฏ่ดๅถๆ–ฏๆจกๅž‹็š„ไธ€ไธชๅฏๆฃ€้ชŒ็š„้ƒจๅˆ†๏ผŒๆˆ‘ไปฌไธ้œ€่ฆๆŒ‰็…งJaynes็š„็†่ฎบไธบๆฏ็งๆƒ…ๅ†ต่ฎพ่ฎกไธ€ไธช็‹ฌ็‰น็š„ใ€ๅฎข่ง‚ๆญฃ็กฎ็š„ๅ…ˆ้ชŒๅˆ†ๅธƒโ€”โ€”ๅ…ณไบŽ่ฟ™็งๅšๆณ•็š„่ฎฐๅฝ•ๅนถไธไปคไบบๆŒฏๅฅ‹(Kass & Wasserman, 1996)๏ผŒๅˆšไธ็”จ่ฏดๅพˆๅคšไฝœ่€…ๅฏนJaynes่ฟ™ไธ€ๅ…ทไฝ“่ง‚็‚นๆŒๆ€€็–‘ๆ€ๅบฆ๏ผˆSeidenfeld, 1979, 1987; Csiszยดar, 1995; Uffink, 1995, 1996)ใ€‚็ฎ€่€Œ่จ€ไน‹๏ผŒๅฏนไบŽ่ดๅถๆ–ฏไธปไน‰่€…ๆฅ่ฏด๏ผŒ"ๆจกๅž‹ "ๆ˜ฏๅ…ˆ้ชŒๅˆ†ๅธƒๅ’Œไผผ็„ถ็š„็ป„ๅˆ๏ผŒๅ…ถไธญๆฏไธ€ไธช้ƒฝไปฃ่กจไบ†็ง‘ๅญฆ็Ÿฅ่ฏ†ใ€ๆ•ฐๅญฆไธŠ็š„ไพฟๅˆฉๅ’Œ่ฎก็ฎ—ไธŠ็š„ๅฏๆ“ไฝœๆ€งไน‹้—ด็š„ๆŸ็งๅฆฅๅใ€‚
่ดๅถๆ–ฏ้žๅ‚ๆ•ฐๅŒ–ๆจกๅž‹ไธญ็š„ไธ็กฎๅฎšๆ€ง่กจ็คบๆ–นๅผๆ˜ฏไธ€ไธชๆŠ€ๆœฏ่ง’ๅบฆไฝ†ๅˆ้žๅธธ้‡่ฆ็š„้—ฎ้ข˜ใ€‚ๅœจๆœ‰้™็ปด็š„้—ฎ้ข˜ไธญ๏ผŒไฝฟ็”จๅŽ้ชŒๅˆ†ๅธƒๆฅ่กจ็คบไธ็กฎๅฎšๆ€งๅœจไธ€ๅฎš็จ‹ๅบฆไธŠๅพ—ๅˆฐไบ†Bernstein-von Mises็Žฐ่ฑก็š„ๆ”ฏๆŒ๏ผŒๅ…ถ็กฎไฟไบ†ๅฏนไบŽๅคงๆ ทๆœฌ่€Œ่จ€๏ผŒๅฏไฟกๅŒบๅŸŸไนŸๆ˜ฏ็ฝฎไฟกๅŒบๅŸŸใ€‚ๅœจๆ— ้™็ปดๆƒ…ๅ†ตไธ‹่ฟ™ไธ€็‚นๅฎŒๅ…จๅคฑๆ•ˆ(Cox,1993๏ผ›Freedman,1999)๏ผŒๅ› ๆญค็ปง็ปญๅคฉ็œŸๅœฐไฝฟ็”จๅŽ้ชŒๅˆ†ๅธƒๆ˜ฏไธๆ˜Žๆ™บ็š„ใ€‚(็”ฑไบŽๆˆ‘ไปฌๆŠŠๅ…ˆ้ชŒๅˆ†ๅธƒๅ’ŒๅŽ้ชŒๅˆ†ๅธƒ่ง†ไธบๆญฃๅˆ™ๅŒ–ๅทฅๅ…ท๏ผŒ่ฟ™ๅฏนๆˆ‘ไปฌๆฅ่ฏดๅนถไธ็‰นๅˆซ้บป็ƒฆ๏ผ‰ไธŽๆญค็›ธๅ…ณ็š„ๆ˜ฏ๏ผŒ่ดๅถๆ–ฏ้žๅ‚ๆ•ฐๆจกๅž‹ไธญ็š„ๅ…ˆ้ชŒๅˆ†ๅธƒๆ˜ฏไธ€ไธช้šๆœบ่ฟ‡็จ‹๏ผŒๆ€ปๆ˜ฏๅŸบไบŽๅฏๆ“ไฝœๆ€ง่€Œ้€‰ๆ‹ฉ(Ghosh & Ramamoorthi, 2003; Hjort et al., 2010)๏ผŒๅ› ๆญคๆ”พๅผƒไบ†ไปปไฝ•่ฏ•ๅ›พไปฃ่กจๅฎž้™…่ฏข้—ฎ่€…ไฟกๅฟต็š„ไผช่ฃ…ใ€‚

Would be great if w2vgrep works by itself. Also, synonym-finder takes a while to find the result. Maybe may 11 year old Windows is just too show.

./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 ๅˆ็†ๆ€ง  0.00s user 0.01s system 0% cpu 54.693 total
arunsupe commented 4 months ago

Odd. synonym-finder uses the same logic functions as w2vgrep. I just deleted unused functions from w2vgrep and am looping through the model's words rather than input text. (The model is basically a dictionary mapping word -> word vector. The word vector is 300 32 bit floats).

The performance bottlenecks are: 1. loading the 2GB model file into memory 2. multiplying 300 floating point numbers for each word comparison. Possible optimizations:

  1. decrease the number of words in the model (FB's models have 2,000,000 words. A smaller number may do. Reducing this to just words people care about)
  2. change model vectors from 300 x 32bit to 300 x 8bit - use 8 bit ints instead of 32 bit floats. Model size will reduce to 25%. But, accuracy will decrease.

I am thinking of implementing 32 bit to 8 bit conversion for the next iteration.

Thanks again for giving me feedback. I am going to close this issue. Keep an eye out for the 8bit models. Will help your performance.

AnabasisXu commented 4 months ago

Thank you and I look forward to the next iteration.