dugongdingo / WEEL

WEEL - Word Embeddings Experiments with Linguality
0 stars 0 forks source link

Filter DataSet #5

Open dugongdingo opened 6 years ago

dugongdingo commented 6 years ago

The current data retrieved from WordNet contains both common nouns and proper nouns, for a total of 96773 unambiguous nouns. It would be best to either find a way to filter out proper nouns from the data set. Current proposals include

  1. removing lemmas which are MWE (but this doesn't take care of single-words entities, like "Jesus"),
  2. looking up whether a specific hypernym, or set of hypernyms exactly covers the set of all proper nouns in WordNet,
  3. cross-checking whether the item is referenced as a common noun in Wiktionary,
  4. or switching to Wiktionary altogether.

In an event, this might significantly reduce the number of usable nouns.

dugongdingo commented 6 years ago

b24e2964d91285a4693fbce9dc2ca682659c39b6 Removing MWE drops the dataset from 96k to 38k items

dugongdingo commented 6 years ago

note: 23K non-ambiguous non-MWE english nouns in Wiktionary