allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.77k stars 2.25k forks source link

Strange things in tokens.txt for Coreference Model #2750

Closed saippuakauppias closed 5 years ago

saippuakauppias commented 5 years ago

Why in vocabulary/tokens.txt in your pretrained model for coreference resolution contains many strage things like URL's, punctuations (only this in str), digits and other?

I dont know how you train this model, but maybe this need normalization for better results?

DeNeutoy commented 5 years ago

The vocabulary is comprised of all the tokens in the train, dev and test sets of the ontonotes corpus. It's true that there is some noise in the text and it could be cleaned, which might result in marginally better performance. However, I doubt that the effect is substantial.

Thanks for the comment!

saippuakauppias commented 5 years ago

@DeNeutoy thanks for reply!

Could you advise me how train own model (I would like to try use normalization before training) for Coreference Resolution? Maybe this is described in some detail in the documentation or could you tell us step by step how the current model was trained? I promise that I will share the results!

piegu commented 2 years ago

The vocabulary is comprised of all the tokens in the train, dev and test sets of the ontonotes corpus.

Hi @DeNeutoy. I'm searching the method + code for getting/creating the file vocabulary/tokens.txt. How do you do that? Thank you.

Note: I'm asking as I would like to use a BERT model with allennlp but I do not understand how to build this file tokens.txt