EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.69k stars 1.57k forks source link

Capital letters in target names #259

Closed JJ closed 8 years ago

JJ commented 8 years ago

I'm trying to run this on a regression problem:

tpot -is , -target Real -mode regression -v 2 -o tpot_traffic_pipeline_2.py blended_1021.csv

The file

DoW,Mon,Hour,Detected,Real
Thursday,10,1,1,129

It returns this error:

TPOT settings:
CROSSOVER_RATE  =   0.05
GENERATIONS =   100
INPUT_FILE  =   blended_1021.csv
INPUT_SEPARATOR =   ,
MAX_TIME_MINS   =   None
MUTATION_RATE   =   0.9
NUM_CV_FOLDS    =   3
OUTPUT_FILE =   tpot_traffic_pipeline_2.py
POPULATION_SIZE =   100
RANDOM_STATE    =   None
SCORING_FN  =   mean_squared_error
TARGET_NAME =   Real
TPOT_MODE   =   regression
VERBOSITY   =   2

Traceback (most recent call last):
  File "/usr/local/bin/tpot", line 9, in <module>
    load_entry_point('TPOT==0.6.1', 'console_scripts', 'tpot')()
  File "/usr/local/lib/python2.7/dist-packages/tpot/driver.py", line 182, in main
    raise ValueError('The provided data file does not seem to have a target column. '
ValueError: The provided data file does not seem to have a target column. Please make sure to specify the target column using the -target parameter.

Context of the issue

It works correctly if real misses its initial caps:

CROSSOVER_RATE  =   0.05
GENERATIONS =   100
INPUT_FILE  =   blended_1021.csv
INPUT_SEPARATOR =   ,
MAX_TIME_MINS   =   None
MUTATION_RATE   =   0.9
NUM_CV_FOLDS    =   3
OUTPUT_FILE =   tpot_traffic_pipeline_2.py
POPULATION_SIZE =   100
RANDOM_STATE    =   None
SCORING_FN  =   mean_squared_error
TARGET_NAME =   real
TPOT_MODE   =   regression
VERBOSITY   =   2

However, the resulting Python file still refers to "class":

import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import FunctionTransformer
from xgboost import XGBRegressor

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
    train_test_split(features, tpot_data['class'], random_state=42)

exported_pipeline = make_pipeline(
    XGBRegressor(learning_rate=0.7, max_depth=1, min_child_weight=11, n_estimators=500, subsample=0.4)
)

exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

Process to reproduce the issue

I guess you can reproduce it from the command line by just naming your target with anything starting with a capital letter.

Expected result

Should be able to use the capital letter in target name

Current result

Is unable to use target names with capital letter.

Possible fix

Parameter processing? Really no idea. Not any good with python.

rhiever commented 8 years ago

This is one of those, "Well, it shouldn't be doing that..." bugs. Will look into it soon. Thank you for filing it!

JJ commented 8 years ago

Thanks!

2016-09-03 15:11 GMT+02:00 Randy Olson notifications@github.com:

This is one of those, "Well, it shouldn't be doing that..." bugs. Will look into it soon. Thank you for filing it!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rhiever/tpot/issues/259#issuecomment-244545591, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAB9P0UmSQMsH9RfUjwkooIwQVTH3e0ks5qmXIJgaJpZM4J0QEn .

JJ

slcott commented 8 years ago

I think this is the problem on line 180 of driver.py:

    input_data = np.recfromcsv(args.INPUT_FILE, delimiter=args.INPUT_SEPARATOR, dtype=np.float64)

And the fix

    input_data = np.recfromcsv(args.INPUT_FILE, delimiter=args.INPUT_SEPARATOR, dtype=np.float64, case_sensitive=True)

Background: http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html#validating-names

I submitted a PR: https://github.com/rhiever/tpot/pull/264

JJ commented 8 years ago

:+1:

rhiever commented 8 years ago

Should be fixed in the 0.6.3 release (just went out now).

JJ commented 8 years ago

:+1:

2016-09-13 18:48 GMT+02:00 Randy Olson notifications@github.com:

Closed #259 https://github.com/rhiever/tpot/issues/259.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rhiever/tpot/issues/259#event-787856469, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAB9M2ZInVKHQ0B6aW7B-0TGOyWkpwAks5qptP6gaJpZM4J0QEn .

JJ