[ValueError]: Number of features of the model must match the input.

When running reconfig.py on large datasets, such as JavaGC and VP9 (more than 100,000 samples), an unexpected ValueError happened. Here are the error trace,

Traceback (most recent call last):
  File "reconfig.py", line 882, in <module>
    reconfig()
  File "reconfig.py", line 856, in reconfig
    predict_on_validation_set()
  File "reconfig.py", line 142, in predict_on_validation_set
    cart_predicted = carts(sub_train_set_rank, dataset_to_test)
  File "reconfig.py", line 58, in carts
    predicted = model.predict(test)
  File "C:\Users\yongfeng\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\tree\tree.py", line 430, in predict
    X = self._validate_X_predict(X, check_input)
  File "C:\Users\yongfeng\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\tree\tree.py", line 402, in _validate_X_predict
    % (self.n_features_, n_features))
ValueError: Number of features of the model must match the input. Model n_features is 44 and input n_features is 39

function predict_on_validation_set().

def predict_on_validation_set():
"""
Note: use the sub_train set to predict on the validation set,
      save the results into "../temp_data/ltr_trainset/${project}/ltr_trainset_XX.csv"
"""
datafolder = "../raw_data/"
trainfolder = "../parse_data/sub_train/"
split_datafolder = "../parse_data/data_split/"
...
sub_train_set_rank_raw = pd.read_csv(sub_train_data[fileindex])
sub_train_set_rank = read_data(sub_train_set_rank_raw)

validation = update_data(validation_set)
dataset_to_test = validation
print("sub_train:", sub_train_data[fileindex], " validation:", csvfile)
cart_predicted = carts(sub_train_set_rank, dataset_to_test)   ### Line 142

function carts(train, test).

def carts(train, test):
"""
Note: use CART to predict preformance in test set
"""
train_independent = [t.decision for t in train]
train_dependent = [t.objective[-1] for t in train]
test = test[test.columns[:-1]]
model = DecisionTreeRegressor()
model.fit(train_independent, train_dependent)
print("sub_train features:", len(train[0].decision))
print("validation features:", len(test.columns[:-1]))  
predicted = model.predict(test)       ### Line 58

Gu-Youngfeng / ReConfigSRC

[ValueError]: Number of features of the model must match the input. #2