Data structure after preprocessing: Feature Extraction input

alexwalterbos / nlp_fake_news

Group repository for the IN4325 NLP project group NLP_Fake_News

0 stars 0 forks source link

Talos' solution generates a file after preprocessing called data.pkl. That file can be loaded from the file system so it is not required to execute the preprocessing every time the analysis runs. It can be loaded as:

import cPickle as cp

# Assumes 'data.pkl' is a file in the current working directory
with open('data.pkl', 'rb') as file:
  data = cp.load(file)

data is now a pandas.DataFrame. There are some methods available to analyze the data structure: data.axes will give you the labels of columns and rows, and data.shape gives the matrix dimensions. See the provided link for more helper methods.

I will map out the structure of the file here, which will give us an idea of the data structure that we'll use as input for the feature extraction functions.

>>> data.shape (75385, 11) # 75385 entries, 11 columns >>> data.axes # Not interesting, this is just the number of articles [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 25403, 25404, 25405, 25406, 25407, 25408, 25409, 25410, 25411, 25412], # Interesting: these are the generated columns. I've added descriptions per column. dtype='int64', length=75385), Index([ u'Body ID', # just index numbers in a pandas.Series u'Headline', # all headlines u'Stance', # stances, one of { 'agree', 'disagree', 'discuss', 'unrelated', NaN } u'articleBody', # all article bodies u'target', # float64 values indicating stance; {'unrelated': 3, 'disagree': 1, 'agree': 0, 'discuss': 2} # Below columns are self-descriptive. Per article, they are a `list` of n-grams joined with '_' u'Headline_unigram', u'articleBody_unigram', u'Headline_bigram', u'articleBody_bigram', u'Headline_trigram', u'articleBody_trigram' ], dtype='object')]

alexwalterbos / nlp_fake_news

Data structure after preprocessing: Feature Extraction input #9