aalok-sathe / surprisal

A unified interface for computing surprisal (log probabilities) from language models! Supports neural, symbolic, and black-box API models.
https://aalok-sathe.github.io/surprisal/
MIT License
32 stars 6 forks source link

compute surprisal for Chinese characters #11

Closed hezaoke closed 1 year ago

hezaoke commented 1 year ago

Is there any way to compute surprisal for Chinese sentences? Right now, the Chinese characters are processed in a weird way and the output does not match the number of Chinese characters in the input.

aalok-sathe commented 1 year ago

Thanks for the message. I think that would depend on the tokenizer of the particular model but I don't see why we couldn't make it work.

Could you share a bit more about your examples and what the expected outcome is vs what you see? Since I don't have experience using Chinese language models, I haven't been able to test this use case.

hezaoke commented 1 year ago

Thank you for your response. Take the following Chinese sentence for example "我早上喝了咖啡." (literally) 'I this morning drank coffee'. I would like to compute the surprisals from the first phrase to the last phrase in the sentence: 我,早上,喝,了,咖啡。What I see when I feed the Chinese characters directly to the surprisal model:

æĪ ij æĹ © ä¸Ĭ å ĸ Ŀ äº Ĩ å Ĵ ĸ å ķ ¡ nan 1.834 6.422 1.423 1.437 4.557 0.457 0.998 1.795 0.001 2.162 0.002 0.000 0.001 0.000 0.000

Any tokenizer that works for Chinese should be fine with me (e.g. a BERT tokenizer). I guess I am not clear about how to use a tokenizer together with the surprisal package.

Is this the type of example you were asking for? Please let me know if you were thinking of some other examples.

aalok-sathe commented 1 year ago

What model do you use here? Note that the choice of model will automatically determine what tokenizer to use with that model.

e.g. here's the example from the README, m = AutoHuggingFaceModel.from_pretrained('gpt2'). What does this line look like for you when you initialize a model?

AFAIK this model (gpt2) is English-only and so wouldn't work very well if you fed in multilingual input. I would suggest searching for an appropriate GPT-like model with a tokenizer that supports Chinese characters (BERT is still in the process of being supported).

aalok-sathe commented 1 year ago

See this line for how the tokenizer gets initialized: https://github.com/aalok-sathe/surprisal/blob/main/surprisal/model.py#L38

aalok-sathe commented 1 year ago

marking issue stale pending more information

hezaoke commented 10 months ago

Sorry for the delayed response. I am thinking to use the Bert model for Chinese.

model_cn = AutoHuggingFaceModel.from_pretrained('bert-base-chinese')

It will "raise NotImplementedError".