Stylometry-based Fraud and Plagiarism Detection for Learning at Scale

[ ] Character Frequency (48 features) The relative frequency of individual characters. This feature set contains the relative frequencies of a-z and A-Z.
[ ] Word Length Frequency (20 features) The relative frequency of word length. In some rare cases the part of speech tagger was not able to filter certain artifacts e.g. long numbers, some e-mail addresses (without the @ sign). This results in particular long words. To filter such elements we only use words of up to 20 characters.
[ ] Sentence Length Frequency (35 features) The relative frequency of sentence length. Similar to the word length feature we filter out overly long sentences longer than 35 words.
[ ] Part of Speech Tag Frequency (35 features) For this feature set we use the Penn Treebank part of speech tag set. We use the Natural Language Toolkit (NLTK[2]) python library to extract these tags from a corpus. We calculate the relative frequency of each tag.
[ ] Word Specificity Frequency (20 features) The specificity of words used by an author is a discriminating feature and a relevant predictor in other Natural Language Tasks (Kilian, Krause, Runge, & Smeddinck, 2012; Krause, 2013). However, to our knowledge this feature have not been used for stylometry yet. To estimate the specificity of a word we use wordnet (Miller, 1995). For each word, we predict the lemma of the word and its part of speech. With the lemma and the part of speech, we retrieve all relevant synsets. The algorithm calculates the distance between each synset and the root node of wordnet. We define specificity as the average depth of these synsets rounded to the nearest integer. The algorithm calculates the relative frequency of each depth. The depth is limited to 20 as higher values tend to be extremely rare

DonaldTsang / stylo

Stylometry-based Fraud and Plagiarism Detection for Learning at Scale #16