Closed saippuakauppias closed 5 years ago
The vocabulary is comprised of all the tokens in the train, dev and test sets of the ontonotes corpus. It's true that there is some noise in the text and it could be cleaned, which might result in marginally better performance. However, I doubt that the effect is substantial.
Thanks for the comment!
@DeNeutoy thanks for reply!
Could you advise me how train own model (I would like to try use normalization before training) for Coreference Resolution? Maybe this is described in some detail in the documentation or could you tell us step by step how the current model was trained? I promise that I will share the results!
The vocabulary is comprised of all the tokens in the train, dev and test sets of the ontonotes corpus.
Hi @DeNeutoy. I'm searching the method + code for getting/creating the file vocabulary/tokens.txt
. How do you do that? Thank you.
Note: I'm asking as I would like to use a BERT model with allennlp but I do not understand how to build this file tokens.txt
Why in
vocabulary/tokens.txt
in your pretrained model for coreference resolution contains many strage things like URL's, punctuations (only this in str), digits and other?I dont know how you train this model, but maybe this need normalization for better results?