When running reconfig.py on large datasets, such as JavaGC and VP9 (more than 100,000 samples), an unexpected ValueError happened. Here are the error trace,
Traceback (most recent call last):
File "reconfig.py", line 882, in <module>
reconfig()
File "reconfig.py", line 856, in reconfig
predict_on_validation_set()
File "reconfig.py", line 142, in predict_on_validation_set
cart_predicted = carts(sub_train_set_rank, dataset_to_test)
File "reconfig.py", line 58, in carts
predicted = model.predict(test)
File "C:\Users\yongfeng\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\tree\tree.py", line 430, in predict
X = self._validate_X_predict(X, check_input)
File "C:\Users\yongfeng\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\tree\tree.py", line 402, in _validate_X_predict
% (self.n_features_, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 44 and input n_features is 39
function predict_on_validation_set().
def predict_on_validation_set():
"""
Note: use the sub_train set to predict on the validation set,
save the results into "../temp_data/ltr_trainset/${project}/ltr_trainset_XX.csv"
"""
datafolder = "../raw_data/"
trainfolder = "../parse_data/sub_train/"
split_datafolder = "../parse_data/data_split/"
...
sub_train_set_rank_raw = pd.read_csv(sub_train_data[fileindex])
sub_train_set_rank = read_data(sub_train_set_rank_raw)
validation = update_data(validation_set)
dataset_to_test = validation
print("sub_train:", sub_train_data[fileindex], " validation:", csvfile)
cart_predicted = carts(sub_train_set_rank, dataset_to_test) ### Line 142
function carts(train, test).
def carts(train, test):
"""
Note: use CART to predict preformance in test set
"""
train_independent = [t.decision for t in train]
train_dependent = [t.objective[-1] for t in train]
test = test[test.columns[:-1]]
model = DecisionTreeRegressor()
model.fit(train_independent, train_dependent)
print("sub_train features:", len(train[0].decision))
print("validation features:", len(test.columns[:-1]))
predicted = model.predict(test) ### Line 58
When running
reconfig.py
on large datasets, such as JavaGC and VP9 (more than 100,000 samples), an unexpected ValueError happened. Here are the error trace,function predict_on_validation_set().
function carts(train, test).