tokenization and WSD - Githubissues

arademaker commented 9 months ago

https://arxiv.org/pdf/1503.01655.pdf

In dictionary bulding section, I didn't find a description about how you deal with the multi-word expressions. How the tokenization and preprocessing of the text were done? In the example

“Alan Kourie, CEO of the Lions franchise, had discussions with Fletcher in Cape Town”

Cape Town need to be tokenized to cape_town (one single token with underscore and lowercase) right? You have cape_town and cape, both as lexical entries for the node Cape_Town. So are you disambiguating the cape and town without detecting the multi-word expression?

% rg "^cape_town " dict_full.txt
7517030:cape_town Headingley_Stadium:1 Newlands_Cricket_Ground:190 Cape_Town_Stadium:19 List_of_Cape_Town_suburbs:1 City_of_Cape_Town:7 Cape_Town_railway_station:10 Archbishop_of_Cape_Town:4 Newlands_Stadium:17 Athlone_Stadium:1 HMS_Capetown_(D88):1 Port_of_Cape_Town:1 St._George's_Cathedral,_Cape_Town:9 University_of_Cape_Town:18 Roman_Catholic_Archdiocese_of_Cape_Town:9 Capetown,_California:1 Same-sex_marriage_in_South_Africa:1 Cape_Town_City_Council:1 Cape_Colony:1 Anglican_Diocese_of_Cape_Town:13 Cape_Town_International_Airport:6 Cape_Town:6646

% rg "^cape " dict_full.txt
4783012:cape Cape_plc:4 Ulster_coat:1 Joey_Cape:3 Cape_of_Good_Hope:70 British_Cape_Colony:7 Computer-aided_production_engineering:3 South_Africa:1 Cape_River:2 Jack_Cape:15 Caffeic_acid_phenethyl_ester:3 Canadian_Association_of_Physicians_for_the_Environment:3 Convective_available_potential_energy:27 Western_Cape_wine:1 Jonathan_Cape:12 Hiland_Park,_Kolkata:1 Cape_Cod_(house):1 Cape_Province:33 Safford_Cape:3 Cape_Ray:1 Headlands_and_bays:57 Dutch_Cape_Colony:1 Cape_Cod:6 The_Cape_(1996_TV_series):3 Cyclically_adjusted_price-to-earnings_ratio:3 Cape_Floristic_Region:2 The_Cape_(2011_TV_series):3 Peninsula:1 Caribbean_Examinations_Council:6 Headland:31 Western_Cape:4 Cape_gauge:2 Cape_Waldron:1 Siffrey_Point:1 Taxidermy:3 Cape_(geography):107 List_of_municipalities_in_Prince_Edward_Island:1 Cape_Dezhnev:1 Capes_on_the_Mississippi_River:1 Director_of_Cost_Assessment_and_Program_Evaluation:3 Cape_Matapan:1 Cape_Parrot:1 Cape_Colony:74 Cape_Malay:1 Cape_(dog):5 University_of_the_Western_Cape:2 Center_for_the_Army_Profession_and_Ethic:3 History_of_Cape_Colony_before_1806:2 Eastern_Cape:1 Drosera_capensis:3 Cape_Cormorant:1 Cape_(writ):5 Cape:197 Cape_Field_at_Fort_Glenn:1 Cape_Town:15 Cape_cobra:3

asoroa commented 9 months ago

That's correct. ukb doesn't do any kind of tokenization, it just matches the input tokens (separated by spaces) with entries in the dictionary. So, if you want to disambiguate the multiword expression, replace "Cape Town" with "cape_town". If you want to disambiguate each word separatedly, use "cape town" (in lower case).

arademaker commented 9 months ago

Hello, indeed, I know that UKB expects as input the already tokenized text, that is, the task of NED is not UKB's job. But what is your approach for the NED or tokenization (I called tokenization because tokens need to be merged)? Specially, the papers that you address specific domains, where we have a lot of multiword expressions.

asoroa / ukb

tokenization and WSD #16