Open alexwalterbos opened 6 years ago
With data
being what's in 'data.pkl', loaded with cPickle
:
>>> data.shape
(75385, 11) # 75385 entries, 11 columns
>>> data.axes
# Not interesting, this is just the number of articles
[Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8,
9,
...
25403, 25404, 25405, 25406, 25407, 25408, 25409, 25410, 25411,
25412],
# Interesting: these are the generated columns. I've added descriptions per column.
dtype='int64', length=75385), Index([
u'Body ID', # just index numbers in a pandas.Series
u'Headline', # all headlines
u'Stance', # stances, one of { 'agree', 'disagree', 'discuss', 'unrelated', NaN }
u'articleBody', # all article bodies
u'target', # float64 values indicating stance; {'unrelated': 3, 'disagree': 1, 'agree': 0, 'discuss': 2}
# Below columns are self-descriptive. Per article, they are a `list` of n-grams joined with '_'
u'Headline_unigram',
u'articleBody_unigram',
u'Headline_bigram',
u'articleBody_bigram',
u'Headline_trigram',
u'articleBody_trigram'
],
dtype='object')]
Talos' solution generates a file after preprocessing called
data.pkl
. That file can be loaded from the file system so it is not required to execute the preprocessing every time the analysis runs. It can be loaded as:data
is now a pandas.DataFrame. There are some methods available to analyze the data structure:data.axes
will give you the labels of columns and rows, anddata.shape
gives the matrix dimensions. See the provided link for more helper methods.I will map out the structure of the file here, which will give us an idea of the data structure that we'll use as input for the feature extraction functions.