We now want to build a classifier for the auto data, with a focus on the numeric data. In the code_for_hw3_part2.py, we have supplied you with the load_auto_data function, which can read the relevant .tsv file. It returns a list of dictionaries, one for each data item.
def load_auto_data(path_data):
"""
Returns a list of dict with keys:
"""
numeric_fields = {'mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model_year', 'origin'}
data = []
with open(path_data) as f_data:
for datum in csv.DictReader(f_data, delimiter='\t'):
for field in list(datum.keys()):
if field in numeric_fields and datum[field]:
datum[field] = float(datum[field])
data.append(datum)
return data
We then specify what feature function to use for each column in the data. The file hw3_part2_main.py has an example that constructs the data and label arrays using raw features for all the columns.
# Returns a list of dictionaries. Keys are the column names, including mpg.
auto_data_all = hw3.load_auto_data('auto-mpg.tsv')
# The choice of feature processing for each feature, mpg is always raw and
# does not need to be specified. Other choices are hw3.standard and hw3.one_hot.
# 'name' is not numeric and would need a different encoding.
features = [('cylinders', hw3.raw),
('displacement', hw3.raw),
('horsepower', hw3.raw),
('weight', hw3.raw),
('acceleration', hw3.raw),
## Drop model_year by default
## ('model_year', hw3.raw),
('origin', hw3.raw)]
# Construct the standard data and label arrays
auto_data, auto_labels = hw3.auto_data_and_labels(auto_data_all, features)
print('auto data and labels shape', auto_data.shape, auto_labels.shape)
In the list features of hw3_part2_main.py, you will find a list of feature name, feature function tuples.
There are three options for feature functions: raw, standard and one_hot.
raw uses the original value;
standard subtracts out the mean value and divides by the standard deviation;
one_hot will one-hot encode the input, as described in the notes.
The function auto_data_and_labels processes the dictionaries and return data, labels. data has dimension (d,392), where d is the total number of features specified, and labels has dimension (1,392).
The data in the file is sorted by class, but it will be shuffled when loaded.
We have included staff implementations of perceptron and average perceptron in code_for_hw3_part2.py.
We have also included staff implementations of eval_classifier and xval_learning_alg (in the same code file). You should use these functions to report accuracies.
We know of two algorithm classes: perceptron and averaged perceptron (which we implemented in HW 1). We have a several parameters that specify the settings for these learning algorithms.
# Perceptron code
# data is dimension d by n
# labels is dimension 1 by n
# T is a positive integer number of steps to run
# Perceptron algorithm with offset.
# data is dimension d by n
# labels is dimension 1 by n
# T is a positive integer number of steps to run
def perceptron(data, labels, params = {}, hook = None):
# if T not in params, default to 50
T = params.get('T', 5)
(d, n) = data.shape
theta = np.zeros((d, 1)); theta_0 = np.zeros((1, 1))
for t in range(T):
for i in range(n):
x = data[:,i:i+1]
y = labels[:,i:i+1]
if y * positive(x, theta, theta_0) <= 0.0:
theta = theta + y * x
theta_0 = theta_0 + y
if hook: hook((theta, theta_0))
return theta, theta_0
def averaged_perceptron(data, labels, params = {}, hook = None):
T = params.get('T', 100)
print("T is",T)
print(data[:,0])
(d, n) = data.shape
theta = np.zeros((d, 1)); theta_0 = np.zeros((1, 1))
theta_sum = theta.copy()
theta_0_sum = theta_0.copy()
for t in range(T):
for i in range(n):
x = data[:,i:i+1]
y = labels[:,i:i+1]
if y * positive(x, theta, theta_0) <= 0.0:
theta = theta + y * x
theta_0 = theta_0 + y
if hook: hook((theta, theta_0))
theta_sum = theta_sum + theta
theta_0_sum = theta_0_sum + theta_0
theta_avg = theta_sum / (T*n)
theta_0_avg = theta_0_sum / (T*n)
if hook: hook((theta_avg, theta_0_avg))
return theta_avg, theta_0_avg
A) Which parameters should we use for the learning algorithm? In the perceptron and averaged perceptron, there is a single parameter, T, the number of iterations.
B) Which features should we use? We have lots of choices here: we can use any subset of the data columns and for each column we have choices of how to compute features.
C) We will use expected accuracy, estimated by 10-fold cross-validation (we have included the definition in the code file), to make these choices of parameters.
We will try two types of algorithms: perceptron and averaged perceptron.
We will try 3 values of T
We will try 2 feature sets:
Perform 10-fold cross-validation for all combinations of the two algorithms, three T values, and the two choices of feature sets. It will be worthwhile investing in a piece of code to carry out all of the evaluations, in case you need to do this more than once.
In general, you should shuffle the dataset before evaluating, but for this exercise, please use hw3.xval_learning_alg, which shuffles the dataset for you, so that your results match ours.
Enter accuracies (perceptron, averaged perceptron) for T=1, feature set 1:
Solution: (0.653, 0.844)
Enter accuracies (perceptron, averaged perceptron) for T=1, feature set 2:
Solution: (0.791, 0.9)
Enter accuracies (perceptron, averaged perceptron) for T=10, feature set 1:
Solution: (0.742, 0.837)
Enter accuracies (perceptron, averaged perceptron) for T=10, feature set 2:
Solution: (0.806, 0.898)
Enter accuracies (perceptron, averaged perceptron) for T=50, feature set 1:
Solution: (0.691, 0.837)
Enter accuracies (perceptron, averaged perceptron) for T=50, feature set 2:
Solution: (0.806, 0.901)
You will want to modify the evaluation algorithms so that they take a T argument to pass to the learners.
Evaluating algorithmic and feature choices for AUTO data
We now want to build a classifier for the auto data, with a focus on the numeric data. In the code_for_hw3_part2.py, we have supplied you with the load_auto_data function, which can read the relevant .tsv file. It returns a list of dictionaries, one for each data item.
We then specify what feature function to use for each column in the data. The file hw3_part2_main.py has an example that constructs the data and label arrays using raw features for all the columns.
In the list features of hw3_part2_main.py, you will find a list of feature name, feature function tuples. There are three options for feature functions: raw, standard and one_hot.
The function
auto_data_and_labels
processes the dictionaries and return data, labels. data has dimension (d,392), where d is the total number of features specified, and labels has dimension (1,392). The data in the file is sorted by class, but it will be shuffled when loaded.We have included staff implementations of perceptron and average perceptron in code_for_hw3_part2.py.
We have also included staff implementations of eval_classifier and xval_learning_alg (in the same code file). You should use these functions to report accuracies.
4.1) Making choices
We know of two algorithm classes: perceptron and averaged perceptron (which we implemented in HW 1). We have a several parameters that specify the settings for these learning algorithms.
A) Which parameters should we use for the learning algorithm? In the perceptron and averaged perceptron, there is a single parameter, T, the number of iterations.
B) Which features should we use? We have lots of choices here: we can use any subset of the data columns and for each column we have choices of how to compute features.
C) We will use expected accuracy, estimated by 10-fold cross-validation (we have included the definition in the code file), to make these choices of parameters.
We will try two types of algorithms: perceptron and averaged perceptron. We will try 3 values of T We will try 2 feature sets:
Perform 10-fold cross-validation for all combinations of the two algorithms, three T values, and the two choices of feature sets. It will be worthwhile investing in a piece of code to carry out all of the evaluations, in case you need to do this more than once.
In general, you should shuffle the dataset before evaluating, but for this exercise, please use hw3.xval_learning_alg, which shuffles the dataset for you, so that your results match ours.
Enter accuracies (perceptron, averaged perceptron) for T=1, feature set 1: Solution: (0.653, 0.844)
Enter accuracies (perceptron, averaged perceptron) for T=1, feature set 2: Solution: (0.791, 0.9)
Enter accuracies (perceptron, averaged perceptron) for T=10, feature set 1: Solution: (0.742, 0.837)
Enter accuracies (perceptron, averaged perceptron) for T=10, feature set 2: Solution: (0.806, 0.898)
Enter accuracies (perceptron, averaged perceptron) for T=50, feature set 1: Solution: (0.691, 0.837)
Enter accuracies (perceptron, averaged perceptron) for T=50, feature set 2: Solution: (0.806, 0.901)
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
T | 1 | 10 | 50 -- | -- | -- | -- perceptron | 0.791 | 0.806 | 0.806 averaged perceptron | 0.9 | 0.898 | 0.901