TromboneDavies / PolarOps

0 stars 0 forks source link

Research what kinds of features we might engineer #31

Closed divilian closed 3 years ago

divilian commented 3 years ago

We already know that "whether the document has stem X somewhere in it or not" is a usable feature. What about all kinds of other things?

akochans commented 3 years ago

For more info, I put a google doc in our shared google drive folder. Most bolded citations in the document are also in Zotero.

Schmidt and Wiegand 2017 talk of "simple surface" features, this is most commonly n-grams, and while effective, they are rarely used alone. Additional simple surface features: Linguistic features from Nobata et al 2016:

divilian commented 3 years ago

For the "words not in the dictionary" feature, what might be the best dictionary to use?

divilian commented 3 years ago

Mostly done. @akochans's Google Doc has a roughly-prioritized list of features we want to try (boldface = high priority; red = our own invention instead of Nobata's.) As for "best dictionary" for judging non-real-words, there seems to be no consensus for this.