MPEDS / mpeds

Machine-learning Protest Event Data System
http://mpeds.github.io
MIT License
35 stars 11 forks source link

Closed-ended classifiers: Inconsistency in handling blank strings #4

Open erleholgersen opened 7 years ago

erleholgersen commented 7 years ago

Blank strings/ strings that contain none of the feature words are currently handled differently by the three closed-ended classifiers.

For form and target, such strings are predicted to belong to the most common class in the training set (rally/ demonstration and domestic government, respectively). For issue, they are classified as none, which is not the most common class in the training set.

See page 19 of Alex's thesis chapter 2, and the following example code

import pandas as pd
from mpeds.classify_protest import MPEDS

test_classifier = MPEDS()
test_data = pd.Series(['', 'avocados and grapefruits'])

test_classifier.getIssue(test_data)
test_classifier.getForm(test_data)
test_classifier.getTarget(test_data)
alexhanna commented 7 years ago

Oh, that seems bizarre. We should probably return a Nonetype if this is the case and throw a warning that says something like "No words found in vectorizer."