Closed AnabasisXu closed 4 months ago
Thanks for trying out sgrep w2vgrep.
For your issue, sounds like the program is using the wrong model. Could you please try the following:
md5sum cc.zh.300.bin 67af1742fa5c1c0fe20cf68aa4447cfb
curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ็
Please let me know if this fixes the issue. If not, I have to think harder.
Thanks! I updated to version 0.6, and have made sure that md5sum is right:
Administrator in CS\cml\semantic-grep via ๐น v1.19.1
โฏ md5sum cc.zh.300.bin
67af1742fa5c1c0fe20cf68aa4447cfb *cc.zh.300.bin
The sample comand in your instruction worked
curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ็
However, it only shows results with Similarity: 1.0000
. I have tried other Chinese characters and the result is all the same. So the model cc.zh.300.bin is not doing its job?
I also tried to use w2vgrep on local files without curl, but my commnds did not work. I expected it would at least show what a grep command will show, as the target word ๅ็ๆง
is indded in the 1.txt
.
Administrator in CS\cml\semantic-grep via ๐น v1.19.1 took 52s
โฏ ./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ๅ็ๆง 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json
Administrator in CS\cml\semantic-grep via ๐น v1.19.1 took 55s
โฏ cat 1.txt |./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin ๅ็ๆง
Using configuration file: G:\CS\cml\semantic-grep\config.json
The default threshold is 0.7, which may be high for your use case. Try lowering it: curl -s https://www.gutenberg.org/cache/epub/25328/pg25328.txt | w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin --threshold=0.3 ็
. This is finding me (่ฆ 0.4776), (้ ญ 0.4200, (้ถด 0.3543). I do not know the language to know if these are good matches.
To help troubleshoot the model, I added a synonym-finder.go
to ./model_processing_utils/
. This program will find similar words to the query word above any threshold in the model.
# build
cd model_processing_utils
go build synonym-finder.go
#run
synonym-finder -model_path path/to/cc.zh.300.bin -threshold 0.5 ๅ็ๆง
# Output:
Words similar to 'ๅ็ๆง' with similarity >= 0.50:
ๅฆฅๅฝๆง 0.5745
ๅจๅปถๆง 0.5535
ๅฎข่งๆง 0.5030
ๅฏๆไฝๆง 0.5053
ๅ็ๆง 1.0000
ไธ่ดๆง 0.5334
ๅฎๅๆง 0.5656
ๅ
ฌๆญฃๆง 0.5245
ๆ็จๆง 0.5316
่ฏ็ซ 0.5147
่ชๆดฝๆง 0.5008
ๆญฃๅฝๆง 0.6018
ๅฟ
่ฆๆง 0.6499
ๅ
ฌๅ
ๆง 0.6152
ๅฏ่กๆง 0.5923
ไธๅ็ๆง 0.6094
ๆ็ๆง 0.5529
ๅๆณๆง 0.6219
ๅบ็ถๆง 0.5709
ไธๅ็ 0.5412
ๆญฃ็ถๆง่ 0.5173
ๆญฃ็กฎๆง 0.5537
ๅ็ 0.5808
ๅฏๆฅๅๆง 0.5151
็งๅญฆๆง 0.6304
่ฎบ่ฏ 0.5379
ๅฎ่ฏๆง 0.5216
ๆๆๆง 0.6374
ๅ
ฌๅนณๆง 0.5250
ๅจๅฏๆง 0.5292
ๅ
ๅๆง 0.5156
ๅปๅๆง 0.5006
ๆฐๅฝๆง 0.5426
ๅฟ
็ถๆง 0.5574
้ๅบฆๆง 0.5401
็ธไผผๆง 0.5101
ๅฎๅคๆง 0.5060
If these don't solve your issue, please get back to me as before.
Hi arunsupe, thanks for developing synonym-finder! As shown in your reply, we can find quite a few synonyms of ๅ็ๆง that are above 0.5 threshold. Just show the top 4 of the result.
Administrator in CS\cml\semantic-grep via ๐น v1.19.1
โฏ ./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 ๅ็ๆง
Words similar to 'ๅ็ๆง' with similarity >= 0.50:
ๅจๅปถๆง 0.5535
ๅฎๅๆง 0.5656
ๅจๅฏๆง 0.5292
ๅ็ 0.5808
Yet querying ๅ็ๆง against 1.txt with a threshold of 0.5 hits no match. Tried lowering to 0.3, it did show a lot of results, though the quality is terrible (basically equals grep ๆง 1.txt
).
Administrator in CS\cml\semantic-grep via ๐น v1.19.1 took 52s
โฏ ./w2vgrep.windows.amd64.exe --model_path=cc.zh.300.bin --threshold=0.5 'ๅ็ๆง' 1.txt
Using configuration file: G:\CS\cml\semantic-grep\config.json
But I have find a workaround by passing the result of synonym-finder to grep:
I use awk to apply each synonym to a grep synonym 1.txt
command, and if there is any result of grep, echo its corresponding similarity.
As the results are long, I limit to one match. I would say the quality of matches is quite good.
Administrator in CS\cml\semantic-grep via ๐น v1.19.1 took 8s
โฏ ./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 ๅ็ๆง | awk 'NR > 1 {print $1, $2}' | while read -r word similarity; do
if grep --color=auto -q "$word" 1.txt; then
echo "\"$word\" Similarity: $similarity"
grep --color=auto "$word" 1.txt
echo ""
fi
done
"ๅฏๆไฝๆง" Similarity: 0.5053
็ฑไบๆไปฌๆๅ
้ชๅๅธ็ไฝๆฏ่ดๅถๆฏๆจกๅ็ไธไธชๅฏๆฃ้ช็้จๅ๏ผๆไปฌไธ้่ฆๆ็
งJaynes็็่ฎบไธบๆฏ็งๆ
ๅต่ฎพ่ฎกไธไธช็ฌ็น็ใๅฎข่งๆญฃ็กฎ็ๅ
้ชๅๅธโโๅ
ณไบ่ฟ็งๅๆณ็่ฎฐๅฝๅนถไธไปคไบบๆฏๅฅ(Kass & Wasserman, 1996)๏ผๅไธ็จ่ฏดๅพๅคไฝ่
ๅฏนJaynes่ฟไธๅ
ทไฝ่ง็นๆๆ็ๆๅบฆ๏ผSeidenfeld, 1979, 1987; Csiszยดar, 1995; Uffink, 1995, 1996)ใ็ฎ่่จไน๏ผๅฏนไบ่ดๅถๆฏไธปไน่
ๆฅ่ฏด๏ผ"ๆจกๅ "ๆฏๅ
้ชๅๅธๅไผผ็ถ็็ปๅ๏ผๅ
ถไธญๆฏไธไธช้ฝไปฃ่กจไบ็งๅญฆ็ฅ่ฏใๆฐๅญฆไธ็ไพฟๅฉๅ่ฎก็ฎไธ็ๅฏๆไฝๆงไน้ด็ๆ็งๅฆฅๅใ
่ดๅถๆฏ้ๅๆฐๅๆจกๅไธญ็ไธ็กฎๅฎๆง่กจ็คบๆนๅผๆฏไธไธชๆๆฏ่งๅบฆไฝๅ้ๅธธ้่ฆ็้ฎ้ขใๅจๆ้็ปด็้ฎ้ขไธญ๏ผไฝฟ็จๅ้ชๅๅธๆฅ่กจ็คบไธ็กฎๅฎๆงๅจไธๅฎ็จๅบฆไธๅพๅฐไบBernstein-von Mises็ฐ่ฑก็ๆฏๆ๏ผๅ
ถ็กฎไฟไบๅฏนไบๅคงๆ ทๆฌ่่จ๏ผๅฏไฟกๅบๅไนๆฏ็ฝฎไฟกๅบๅใๅจๆ ้็ปดๆ
ๅตไธ่ฟไธ็นๅฎๅ
จๅคฑๆ(Cox,1993๏ผFreedman,1999)๏ผๅ ๆญค็ปง็ปญๅคฉ็ๅฐไฝฟ็จๅ้ชๅๅธๆฏไธๆๆบ็ใ(็ฑไบๆไปฌๆๅ
้ชๅๅธๅๅ้ชๅๅธ่งไธบๆญฃๅๅๅทฅๅ
ท๏ผ่ฟๅฏนๆไปฌๆฅ่ฏดๅนถไธ็นๅซ้บป็ฆ๏ผไธๆญค็ธๅ
ณ็ๆฏ๏ผ่ดๅถๆฏ้ๅๆฐๆจกๅไธญ็ๅ
้ชๅๅธๆฏไธไธช้ๆบ่ฟ็จ๏ผๆปๆฏๅบไบๅฏๆไฝๆง่้ๆฉ(Ghosh & Ramamoorthi, 2003; Hjort et al., 2010)๏ผๅ ๆญคๆพๅผไบไปปไฝ่ฏๅพไปฃ่กจๅฎ้
่ฏข้ฎ่
ไฟกๅฟต็ไผช่ฃ
ใ
Would be great if w2vgrep works by itself. Also, synonym-finder takes a while to find the result. Maybe may 11 year old Windows is just too show.
./synonym-finder.exe --model_path=cc.zh.300.bin -threshold 0.5 ๅ็ๆง 0.00s user 0.01s system 0% cpu 54.693 total
Odd. synonym-finder
uses the same logic functions as w2vgrep. I just deleted unused functions from w2vgrep and am looping through the model's words rather than input text. (The model is basically a dictionary mapping word -> word vector. The word vector is 300 32 bit floats).
The performance bottlenecks are: 1. loading the 2GB model file into memory 2. multiplying 300 floating point numbers for each word comparison. Possible optimizations:
I am thinking of implementing 32 bit to 8 bit conversion for the next iteration.
Thanks again for giving me feedback. I am going to close this issue. Keep an eye out for the 8bit models. Will help your performance.
Thank you and I look forward to the next iteration.
First of all, thank you so much for multi language support.
I have followed the instruction by doing 4 things. No errors was reported, but it did not work with Chinese.
I have prepared a
1.txt
file with UTF-8 conding and made sure thatgrep
command will find the keyword in the file.Interestingly, the Chinese model works for the English test.
glory
inhm.txt
(Old Man & Sea)