CH4- Evaluating algorithmic and feature choices

You will want to modify the evaluation algorithms so that they take a T argument to pass to the learners.

Evaluating algorithmic and feature choices for AUTO data

mpg cylinders   displacement    horsepower  weight  acceleration    model_year  origin  car_name
-1  8   304 193 4732    18.5    70  1   "hi 1200d"
-1  8   307 200 4376    15  70  1   "chevy c20"
-1  8   360 215 4615    14  70  1   "ford f250"
-1  8   318 210 4382    13.5    70  1   "dodge d200"
-1  8   350 180 3664    11  73  1   "oldsmobile omega"
-1  8   400 150 4997    14  73  1   "chevrolet impala"
-1  8   429 208 4633    11  72  1   "mercury marquis"
-1  8   350 160 4456    13.5    72  1   "oldsmobile delta 88 royale"
-1  8   350 180 4499    12.5    73  1   "oldsmobile vista cruiser"
-1  8   383 180 4955    11.5    71  1   "dodge monaco (sw)"
-1  8   400 167 4906    12.5    73  1   "ford country"

We now want to build a classifier for the auto data, with a focus on the numeric data. In the code_for_hw3_part2.py, we have supplied you with the load_auto_data function, which can read the relevant .tsv file. It returns a list of dictionaries, one for each data item.


def load_auto_data(path_data):
    """
    Returns a list of dict with keys:
    """
    numeric_fields = {'mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
                      'acceleration', 'model_year', 'origin'}
    data = []
    with open(path_data) as f_data:
        for datum in csv.DictReader(f_data, delimiter='\t'):
            for field in list(datum.keys()):
                if field in numeric_fields and datum[field]:
                    datum[field] = float(datum[field])
            data.append(datum)
    return data

We then specify what feature function to use for each column in the data. The file hw3_part2_main.py has an example that constructs the data and label arrays using raw features for all the columns.

# Returns a list of dictionaries.  Keys are the column names, including mpg.
auto_data_all = hw3.load_auto_data('auto-mpg.tsv')

# The choice of feature processing for each feature, mpg is always raw and
# does not need to be specified.  Other choices are hw3.standard and hw3.one_hot.
# 'name' is not numeric and would need a different encoding.
features = [('cylinders', hw3.raw),
            ('displacement', hw3.raw),
            ('horsepower', hw3.raw),
            ('weight', hw3.raw),
            ('acceleration', hw3.raw),
            ## Drop model_year by default
            ## ('model_year', hw3.raw),
            ('origin', hw3.raw)]

# Construct the standard data and label arrays
auto_data, auto_labels = hw3.auto_data_and_labels(auto_data_all, features)
print('auto data and labels shape', auto_data.shape, auto_labels.shape)

In the list features of hw3_part2_main.py, you will find a list of feature name, feature function tuples. There are three options for feature functions: raw, standard and one_hot.

raw uses the original value;
standard subtracts out the mean value and divides by the standard deviation;
one_hot will one-hot encode the input, as described in the notes.

The function auto_data_and_labels processes the dictionaries and return data, labels. data has dimension (d,392), where d is the total number of features specified, and labels has dimension (1,392). The data in the file is sorted by class, but it will be shuffled when loaded.

We have included staff implementations of perceptron and average perceptron in code_for_hw3_part2.py.

We have also included staff implementations of eval_classifier and xval_learning_alg (in the same code file). You should use these functions to report accuracies.

def eval_classifier(learner, data_train, labels_train, data_test, labels_test):
    th, th0 = learner(data_train, labels_train)
    return score(data_test, labels_test, th, th0)/data_test.shape[1]

def xval_learning_alg(learner, data, labels, k):
    _, n = data.shape
    idx = list(range(n))
    np.random.seed(0)
    np.random.shuffle(idx)
    data, labels = data[:,idx], labels[:,idx]

    s_data = np.array_split(data, k, axis=1)
    s_labels = np.array_split(labels, k, axis=1)

    score_sum = 0
    for i in range(k):
        data_train = np.concatenate(s_data[:i] + s_data[i+1:], axis=1)
        labels_train = np.concatenate(s_labels[:i] + s_labels[i+1:], axis=1)
        data_test = np.array(s_data[i])
        labels_test = np.array(s_labels[i])
        score_sum += eval_classifier(learner, data_train, labels_train,
                                              data_test, labels_test)
    return score_sum/k

4.1) Making choices

We know of two algorithm classes: perceptron and averaged perceptron (which we implemented in HW 1). We have a several parameters that specify the settings for these learning algorithms.

# Perceptron code

# data is dimension d by n
# labels is dimension 1 by n
# T is a positive integer number of steps to run
# Perceptron algorithm with offset.
# data is dimension d by n
# labels is dimension 1 by n
# T is a positive integer number of steps to run
def perceptron(data, labels, params = {}, hook = None):
    # if T not in params, default to 50
    T = params.get('T', 5)
    (d, n) = data.shape

    theta = np.zeros((d, 1)); theta_0 = np.zeros((1, 1))
    for t in range(T):
        for i in range(n):
            x = data[:,i:i+1]
            y = labels[:,i:i+1]
            if y * positive(x, theta, theta_0) <= 0.0:
                theta = theta + y * x
                theta_0 = theta_0 + y
                if hook: hook((theta, theta_0))
    return theta, theta_0

def averaged_perceptron(data, labels, params = {}, hook = None):
    T = params.get('T', 100)
    print("T is",T)
    print(data[:,0])
    (d, n) = data.shape

    theta = np.zeros((d, 1)); theta_0 = np.zeros((1, 1))
    theta_sum = theta.copy()
    theta_0_sum = theta_0.copy()
    for t in range(T):
        for i in range(n):
            x = data[:,i:i+1]
            y = labels[:,i:i+1]
            if y * positive(x, theta, theta_0) <= 0.0:
                theta = theta + y * x
                theta_0 = theta_0 + y
                if hook: hook((theta, theta_0))
            theta_sum = theta_sum + theta
            theta_0_sum = theta_0_sum + theta_0
    theta_avg = theta_sum / (T*n)
    theta_0_avg = theta_0_sum / (T*n)
    if hook: hook((theta_avg, theta_0_avg))
    return theta_avg, theta_0_avg

A) Which parameters should we use for the learning algorithm? In the perceptron and averaged perceptron, there is a single parameter, T, the number of iterations.

B) Which features should we use? We have lots of choices here: we can use any subset of the data columns and for each column we have choices of how to compute features.

C) We will use expected accuracy, estimated by 10-fold cross-validation (we have included the definition in the code file), to make these choices of parameters.

We will try two types of algorithms: perceptron and averaged perceptron. We will try 3 values of T We will try 2 feature sets:

[cylinders=raw, displacement=raw, horsepower=raw, weight=raw, acceleration=raw, origin=raw]
[cylinders=one_hot, displacement=standard, horsepower=standard, weight=standard, acceleration=standard, origin=one_hot]

Perform 10-fold cross-validation for all combinations of the two algorithms, three T values, and the two choices of feature sets. It will be worthwhile investing in a piece of code to carry out all of the evaluations, in case you need to do this more than once.

In general, you should shuffle the dataset before evaluating, but for this exercise, please use hw3.xval_learning_alg, which shuffles the dataset for you, so that your results match ours.

Enter accuracies (perceptron, averaged perceptron) for T=1, feature set 1: Solution: (0.653, 0.844)

Enter accuracies (perceptron, averaged perceptron) for T=1, feature set 2: Solution: (0.791, 0.9)

Enter accuracies (perceptron, averaged perceptron) for T=10, feature set 1: Solution: (0.742, 0.837)

Enter accuracies (perceptron, averaged perceptron) for T=10, feature set 2: Solution: (0.806, 0.898)

Enter accuracies (perceptron, averaged perceptron) for T=50, feature set 1: Solution: (0.691, 0.837)

Enter accuracies (perceptron, averaged perceptron) for T=50, feature set 2: Solution: (0.806, 0.901)

T | 1 | 10 | 50 -- | -- | -- | -- perceptron | 0.791 | 0.806 | 0.806 averaged perceptron | 0.9 | 0.898 | 0.901

Which algorithm class is typically more effective? Solution: Averaged Perceptron

For the better algorithm, which combination of T and feature would you use? Consider expected accuracy as of primary importance, take into account running time for near ties in accuracy.

Solution: '(1, 2) or (10, 2)'

Explanation: (1, 2) is the better choice because it gets nearly as good accuracy as using more iterations, but does far less work. We accept (10, 2) as well because before we did staff revisions to this question, it was better, and we forgot to fix the answer!

Based on the values of the coefficients, which feature has the most impact on the output predictions? Solution: cylinders or weight, depending on how you compute this

def xval_learning_alg(learner, data, labels, k):
    _, n = data.shape
    idx = list(range(n))
    np.random.seed(0)
    np.random.shuffle(idx)
    data, labels = data[:,idx], labels[:,idx]

    s_data = np.array_split(data, k, axis=1)
    s_labels = np.array_split(labels, k, axis=1)

    score_sum = 0
    for i in range(k):
        data_train = np.concatenate(s_data[:i] + s_data[i+1:], axis=1)
        labels_train = np.concatenate(s_labels[:i] + s_labels[i+1:], axis=1)
        data_test = np.array(s_data[i])
        labels_test = np.array(s_labels[i])
        score_sum += eval_classifier(learner, data_train, labels_train,
                                              data_test, labels_test)
    return score_sum/k

print(hw3.xval_learning_alg(hw3.averaged_perceptron, auto_data, auto_labels, 10))

Evaluating algorithmic and feature choices for review data

load_review_data function, that can be used to read a .tsv file and return the labels and texts.

def load_review_data(path_data):
    """
    Returns a list of dict with keys:
    * sentiment: +1 or -1 if the review was positive or negative, respectively
    * text: the text of the review
    """
    basic_fields = {'sentiment', 'text'}
    data = []
    with open(path_data) as f_data:
        for datum in csv.DictReader(f_data, delimiter='\t'):
            for field in list(datum.keys()):
                if field not in basic_fields:
                    del datum[field]
            if datum['sentiment']:
                datum['sentiment'] = int(datum['sentiment'])
            data.append(datum)
    return data

load進來後的dic長到像這樣

{'sentiment': 1, 'text': 'This licorice is sooooo good. Soft, very flavorful. Hard to find in stores<br />now- this flavor- black cherry- so I would recommend.'}, {'sentiment': -1, 'text': 'Why do all those K Cups taste like instant coffee? Ive tried so many brands as I really like the simplicity of the<br />Keurig coffeemaker. The decaf is even worse. Somebody tell me of a brand that really tastes like brewed coffee.'}

bag_of_words function, which takes the raw data and returns a dictionary of unigram words


def bag_of_words(texts):
    """
    Inputs a list of string reviews
    Returns a dictionary of unique unigrams occurring over the input

    Feel free to change this code as guided by Section 3 (e.g. remove stopwords, add bigrams etc.)
    """
    dictionary = {} # maps word to unique index
    for text in texts:
        word_list = extract_words(text)
        for word in word_list:
            if word not in dictionary:
                dictionary[word] = len(dictionary)
    return dictionary

The resulting dictionary is an input to extract_bow_feature_vectors which computes a feature matrix of ones and zeros that can be used as the input for the classification algorithms.


def extract_bow_feature_vectors(reviews, dictionary):
    """
    Inputs a list of string reviews
    Inputs the dictionary of words as given by bag_of_words
    Returns the bag-of-words feature matrix representation of the data.
    The returned matrix is of shape (n, m), where n is the number of reviews
    and m the total number of entries in the dictionary.
    """

    num_reviews = len(reviews)
    feature_matrix = np.zeros([num_reviews, len(dictionary)])

    for i, text in enumerate(reviews):
        word_list = extract_words(text)
        for word in word_list:
            if word in dictionary:
                feature_matrix[i, dictionary[word]] = 1
    # We want the feature vectors as columns
    return feature_matrix.T


# Returns lists of dictionaries.  Keys are the column names, 'sentiment' and 'text'.
# The train data has 10,000 examples
review_data = hw3.load_review_data('reviews.tsv')

# Lists texts of reviews and list of labels (1 or -1)
review_texts, review_label_list = zip(*((sample['text'], sample['sentiment']) for sample in review_data))

# The dictionary of all the words for "bag of words"
dictionary = hw3.bag_of_words(review_texts)

# The standard data arrays for the bag of words
review_bow_data = hw3.extract_bow_feature_vectors(review_texts, dictionary)
review_labels = hw3.rv(review_label_list)
print('review_bow_data and labels shape', review_bow_data.shape, review_labels.shape)

review_bow_data and labels shape (19945, 10000) (1, 10000)

5.1) Making choices

We have two algorithm classes: perceptron and averaged perceptron. We have a couple of choices of parameters to make to completely specify the learning algorithms.

def get_classification_accuracy(data, labels):
    """
    @param data (d,n) array
    @param labels (1,n) array
    """
    return xval_learning_alg(lambda data, labels: perceptron(data, labels, {"T": 50}), data, labels, 10)

Which parameters should we use for the learning algorithm? In the perceptron and averaged perceptron, there is a single parameter, T, the number of iterations.

Perform 10-fold cross-validation for all combinations of the two algorithms and three T values (1, 10, 50). Record the accuracies for each combination (there should be 6 values total).

Which features should we use? We could do variations of bag-of-words, for example, simply indicating whether a word is present or, alternatively, using a count of how many times it is present. We can also remove commonly used words with little information; the code distribution includes a file of those words: stopwords.txt. You're welcome to explore these on your own; we'll use only a binary indicator for all the words.

Which algorithm class is typically more effective? Solution: Averaged Perceptron

5.1E) For the better algorithm, which value of T would you use? Consider expected accuracy as of primary importance, take into account running time for near ties in accuracy. Solution: 10

# Analyze review data
print(hw3.get_classification_accuracy(review_bow_data, review_labels))

For the better algorithm and best value of T, what is your accuracy? Solution: 0.823

5.2) Analysis

For the best algorithm and best T, construct your best classifier

5.2A) What are the 10 most positive words in the dictionary, that is, the words that contribute most to a positive prediction? Solution: ['great', 'delicious', 'perfect', 'excellent', 'satisfied', 'yummy', 'easily', 'individually', 'bright', 'skeptical'] 5.2B) What are the 10 most negative words in the dictionary, that is, the words that contribute most to a negative prediction. Solution: ['worst', 'awful', 'poor', 'horrible', 'unfortunately', 'formula', 'bland', 'stuck', 'disappointment', 'changed']


th, th0 = hw3.averaged_perceptron(review_bow_data,review_labels)

max_10_id = np.where(th > np.sort(th.flatten())[-11])
# print(max_10_id)
#(array([ 348,  740, 1041, 1105, 1810, 2267, 2329, 2413, 3011], dtype=int64), array([0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64))
max_10_id = max_10_id[0]

d = hw3.reverse_dict(dictionary)

for i in max_10_id:
    print(d[i])

th, th0 = hw3.averaged_perceptron(review_bow_data,review_labels)

max_10_id = np.where(th > np.sort(th.flatten())[-10])
print(max_10_id)

Pin-Jiun / Machine-Learning-MIT