Closed hezaoke closed 1 year ago
Thanks for the message. I think that would depend on the tokenizer of the particular model but I don't see why we couldn't make it work.
Could you share a bit more about your examples and what the expected outcome is vs what you see? Since I don't have experience using Chinese language models, I haven't been able to test this use case.
Thank you for your response. Take the following Chinese sentence for example "我早上喝了咖啡." (literally) 'I this morning drank coffee'. I would like to compute the surprisals from the first phrase to the last phrase in the sentence: 我,早上,喝,了,咖啡。What I see when I feed the Chinese characters directly to the surprisal model:
æĪ ij æĹ © ä¸Ĭ å ĸ Ŀ äº Ĩ å Ĵ ĸ å ķ ¡ nan 1.834 6.422 1.423 1.437 4.557 0.457 0.998 1.795 0.001 2.162 0.002 0.000 0.001 0.000 0.000
Any tokenizer that works for Chinese should be fine with me (e.g. a BERT tokenizer). I guess I am not clear about how to use a tokenizer together with the surprisal package.
Is this the type of example you were asking for? Please let me know if you were thinking of some other examples.
What model do you use here? Note that the choice of model will automatically determine what tokenizer to use with that model.
e.g. here's the example from the README,
m = AutoHuggingFaceModel.from_pretrained('gpt2')
. What does this line look like for you when you initialize a model?
AFAIK this model (gpt2
) is English-only and so wouldn't work very well if you fed in multilingual input. I would suggest searching for an appropriate GPT-like model with a tokenizer that supports Chinese characters (BERT is still in the process of being supported).
See this line for how the tokenizer gets initialized: https://github.com/aalok-sathe/surprisal/blob/main/surprisal/model.py#L38
marking issue stale pending more information
Sorry for the delayed response. I am thinking to use the Bert model for Chinese.
model_cn = AutoHuggingFaceModel.from_pretrained('bert-base-chinese')
It will "raise NotImplementedError".
Is there any way to compute surprisal for Chinese sentences? Right now, the Chinese characters are processed in a weird way and the output does not match the number of Chinese characters in the input.