Tokenization details - Githubissues

HIT-SCIR / ELMoForManyLangs

Pre-trained ELMo Representations for Many Languages

MIT License

1.46k stars 243 forks source link

Tokenization details #45

Closed sbmaruf closed 5 years ago

sbmaruf commented 5 years ago

In the readme file it was written "Do remember tokenization!". What type of tokenization is needed. Do we need to give case sensitive or case insensitive input to the model, is there any normalization involved?

Oneplus commented 5 years ago

Please use udpipe for tokenization.

sbmaruf commented 5 years ago

Hi! @Oneplus Thank you for quick reply. Can you take a look on this issue, # . Do I have to add "" and "" at the start and end of the sentences?

Oneplus commented 5 years ago

Hi @sbmaruf you won't need the starting " and ending " for each sentence. We've already handle the bos and eos token in our code. It seems the unstable output is resulted from the stateful LSTM in ELMo. But empirically, it does not hurt the performance.

sbmaruf commented 5 years ago

@Oneplus Thank you for your reply. That helps a lot.

PawelFaron commented 5 years ago

Hi @sbmaruf you won't need the starting " and ending " for each sentence. We've already handle the bos and eos token in our code. It seems the unstable output is resulted from the stateful LSTM in ELMo. But empirically, it does not hurt the performance.

Hi @Oneplus. Is there maybe some way to disable the statefulness of LSTM in ELMo? It hurts performance in my case.