Open arademaker opened 9 months ago
That's correct. ukb doesn't do any kind of tokenization, it just matches the input tokens (separated by spaces) with entries in the dictionary. So, if you want to disambiguate the multiword expression, replace "Cape Town" with "cape_town". If you want to disambiguate each word separatedly, use "cape town" (in lower case).
Hello, indeed, I know that UKB expects as input the already tokenized text, that is, the task of NED is not UKB's job. But what is your approach for the NED or tokenization (I called tokenization because tokens need to be merged)? Specially, the papers that you address specific domains, where we have a lot of multiword expressions.
https://arxiv.org/pdf/1503.01655.pdf
In dictionary bulding section, I didn't find a description about how you deal with the multi-word expressions. How the tokenization and preprocessing of the text were done? In the example
Cape Town need to be tokenized to cape_town (one single token with underscore and lowercase) right? You have cape_town and cape, both as lexical entries for the node Cape_Town. So are you disambiguating the cape and town without detecting the multi-word expression?