NouamaneTazi / bloomz.cpp

C++ implementation for BLOOM
MIT License
812 stars 65 forks source link

Something wrong with the tokenize function. #30

Open samsha1971 opened 1 year ago

samsha1971 commented 1 year ago

The ggml model converted from "YeungNLP/bloomz-396m-zh" or "WangZeJun/bloom-396m-chat" lacks some tokens, such as the string "焙" or "擀", without corresponding tokens, the generated result cannot be displayed. However, in the official python way of the model, there is no such problem.

Sample, Notice the "�" section:

main: prompt: '面包的烘焙制作流程'
main: number of tokens in prompt = 3
 24765 -> '面包'
   373 -> '的'
 28967 -> '烘'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

面包(24765)的(373)烘(28967)�(1165)�(237)技巧(16012):(1038)
(189)1(20).(17) (210)面(1157)条(1996)要(853)煮熟(43916),(355)否则(14458)容易(7305)粘(14494)。(420) 
(2813)2(21).(17) 应(23830)使用(2527)烤(15337)箱(8226)而不是(12285)微波(30656)炉(16613)加热(25228)面团(44449)。
(672)3(22).(17) 用(16647)冷水(33637)淋(15735)湿(10556)面团(44449)以防止(31473)黏(19639)在一起(10919)。
(672)4(23).(17) 在(3612)预(3119)热(4291)至(1546)摄氏(39868)175(13634)度(1423)时(1018)开始(3590)烘(28967)�(1165)�(237),(355)直到(8326)底部(26609)变得
(13044)金(1539)黄色(21313)并(1437)散(4711)发出(13801)香味(32740)即可(10134)享用(42892)</s>(2) [end of text]

main: mem per token =  4944640 bytes
main:     load time =   558.57 ms
main:   sample time =   516.50 ms
main:  predict time =  3674.82 ms / 52.50 ms per token
main:    total time =  4945.50 ms