alexwalterbos / nlp_fake_news

Group repository for the IN4325 NLP project group NLP_Fake_News
0 stars 0 forks source link

Data structure after preprocessing: Feature Extraction input #9

Open alexwalterbos opened 6 years ago

alexwalterbos commented 6 years ago

Talos' solution generates a file after preprocessing called data.pkl. That file can be loaded from the file system so it is not required to execute the preprocessing every time the analysis runs. It can be loaded as:

import cPickle as cp

# Assumes 'data.pkl' is a file in the current working directory
with open('data.pkl', 'rb') as file:
  data = cp.load(file) 

data is now a pandas.DataFrame. There are some methods available to analyze the data structure: data.axes will give you the labels of columns and rows, and data.shape gives the matrix dimensions. See the provided link for more helper methods.

I will map out the structure of the file here, which will give us an idea of the data structure that we'll use as input for the feature extraction functions.

alexwalterbos commented 6 years ago

With data being what's in 'data.pkl', loaded with cPickle:

>>> data.shape
(75385, 11) # 75385 entries, 11 columns
>>> data.axes
# Not interesting, this is just the number of articles
[Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            25403, 25404, 25405, 25406, 25407, 25408, 25409, 25410, 25411,
            25412],
# Interesting: these are the generated columns. I've added descriptions per column.
           dtype='int64', length=75385), Index([            
            u'Body ID', # just index numbers in a pandas.Series
            u'Headline', # all headlines
            u'Stance', # stances, one of { 'agree', 'disagree', 'discuss', 'unrelated', NaN } 
            u'articleBody', # all article bodies
            u'target', # float64 values indicating stance; {'unrelated': 3, 'disagree': 1, 'agree': 0, 'discuss': 2}
            # Below columns are self-descriptive. Per article, they are a `list` of n-grams joined with '_'
            u'Headline_unigram',
            u'articleBody_unigram', 
            u'Headline_bigram',
            u'articleBody_bigram',
            u'Headline_trigram',
            u'articleBody_trigram'
],
      dtype='object')]