EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.61k stars 1.56k forks source link

TypeError At 5th Generation #248

Closed hexhead closed 7 years ago

hexhead commented 7 years ago

Submitted a run of TPOT on a genetics data set of binary predictors and a binary response to our cluster. The data set is 89 subjects/rows by 3925 predictors.

Context of the issue

Command-line version of the TPOT algorithm.

Process to reproduce the issue

tpot [redacted]_maf_ld_filtered.csv -is , -o tpot_exported_pipeline_harkness.py -g 5 -p 20 -cv 5 -s 42 -v 2

Expected result

Classification and solution summary(ies).

Current result

Python trapped error message.

Possible fix

Not sure. Sorry but this data is not available for sharing.

TypeError screenshot

tpot_typeerror

rhiever commented 7 years ago

What version of TPOT are you running? Run on the command line:

python -c "import tpot; print('tpot %s' % tpot.__version__)"

My first thought is that this is a data issue. Is the column being predicted discrete or continuous? Are all of the columns in "scikit-learn-compatible" format (i.e., all numerical)?

hexhead commented 7 years ago

bwhite@login:~/analysis/TPOT_tests$ python -c "import tpot; print('tpot %s' % tpot.version)" tpot 0.5.2

As I mentioned in the issue description:

Submitted a run of TPOT on a genetics data set of binary predictors and a binary response to our cluster. The data set is 89 subjects/rows by 3925 predictors.

I missed the scikit-learn requirement for the data, but looking through all your examples and scikit-learn docs, I can make the data compatible: http://skll.readthedocs.io/en/latest/run_experiment.html

Thanks!

hexhead commented 7 years ago

Sorry, the predictors are SNPs, so integers. I also want to test RNA-Seq, which is counts/integers, with both discrete and continuous responses.

hexhead commented 7 years ago

MDR-SampleData runs fine, so my data should too. I'll try it again.

bwhite@login:~/analysis/TPOT_tests$ tail -f slurm-2386.out Number of parallel processes: 32 Running TPOT example...

TPOT settings: CROSSOVER_RATE = 0.05 GENERATIONS = 5 INPUT_FILE = MDR-SampleData.csv INPUT_SEPARATOR = , MUTATION_RATE = 0.9 NUM_CV_FOLDS = 5 OUTPUT_FILE = tpot_exported_pipeline.py POPULATION_SIZE = 20 RANDOM_STATE = 42 SCORING_FN = balanced_accuracy VERBOSITY = 2

Generation 1 - Current best internal CV score: 0.552890116002 Generation 2 - Current best internal CV score: 0.556965576831 Generation 3 - Current best internal CV score: 0.589809609884 Generation 4 - Current best internal CV score: 0.624165243127 Generation 5 - Current best internal CV score: 0.643017539329

Best pipeline: DecisionTreeClassifier(RandomizedPCA(MinMaxScaler(StandardScaler(input_matrix)), 2))

Training accuracy: 1.0 Holdout accuracy: 0.678571428571

hexhead commented 7 years ago

Resolved. "Class" versus "class" for the response column name in the CSV file.

rhiever commented 7 years ago

Happy to hear it!