Closed divilian closed 3 years ago
For more info, I put a google doc in our shared google drive folder. Most bolded citations in the document are also in Zotero.
Schmidt and Wiegand 2017 talk of "simple surface" features, this is most commonly n-grams, and while effective, they are rarely used alone. Additional simple surface features: Linguistic features from Nobata et al 2016:
For the "words not in the dictionary" feature, what might be the best dictionary to use?
Mostly done. @akochans's Google Doc has a roughly-prioritized list of features we want to try (boldface = high priority; red = our own invention instead of Nobata's.) As for "best dictionary" for judging non-real-words, there seems to be no consensus for this.
We already know that "whether the document has stem X somewhere in it or not" is a usable feature. What about all kinds of other things?