Closed hexhead closed 7 years ago
What version of TPOT are you running? Run on the command line:
python -c "import tpot; print('tpot %s' % tpot.__version__)"
My first thought is that this is a data issue. Is the column being predicted discrete or continuous? Are all of the columns in "scikit-learn-compatible" format (i.e., all numerical)?
bwhite@login:~/analysis/TPOT_tests$ python -c "import tpot; print('tpot %s' % tpot.version)" tpot 0.5.2
As I mentioned in the issue description:
Submitted a run of TPOT on a genetics data set of binary predictors and a binary response to our cluster. The data set is 89 subjects/rows by 3925 predictors.
I missed the scikit-learn requirement for the data, but looking through all your examples and scikit-learn docs, I can make the data compatible: http://skll.readthedocs.io/en/latest/run_experiment.html
Thanks!
Sorry, the predictors are SNPs, so integers. I also want to test RNA-Seq, which is counts/integers, with both discrete and continuous responses.
MDR-SampleData runs fine, so my data should too. I'll try it again.
bwhite@login:~/analysis/TPOT_tests$ tail -f slurm-2386.out Number of parallel processes: 32 Running TPOT example...
TPOT settings: CROSSOVER_RATE = 0.05 GENERATIONS = 5 INPUT_FILE = MDR-SampleData.csv INPUT_SEPARATOR = , MUTATION_RATE = 0.9 NUM_CV_FOLDS = 5 OUTPUT_FILE = tpot_exported_pipeline.py POPULATION_SIZE = 20 RANDOM_STATE = 42 SCORING_FN = balanced_accuracy VERBOSITY = 2
Generation 1 - Current best internal CV score: 0.552890116002 Generation 2 - Current best internal CV score: 0.556965576831 Generation 3 - Current best internal CV score: 0.589809609884 Generation 4 - Current best internal CV score: 0.624165243127 Generation 5 - Current best internal CV score: 0.643017539329
Best pipeline: DecisionTreeClassifier(RandomizedPCA(MinMaxScaler(StandardScaler(input_matrix)), 2))
Training accuracy: 1.0 Holdout accuracy: 0.678571428571
Resolved. "Class" versus "class" for the response column name in the CSV file.
Happy to hear it!
Submitted a run of TPOT on a genetics data set of binary predictors and a binary response to our cluster. The data set is 89 subjects/rows by 3925 predictors.
Context of the issue
Command-line version of the TPOT algorithm.
Process to reproduce the issue
tpot [redacted]_maf_ld_filtered.csv -is , -o tpot_exported_pipeline_harkness.py -g 5 -p 20 -cv 5 -s 42 -v 2
Expected result
Classification and solution summary(ies).
Current result
Python trapped error message.
Possible fix
Not sure. Sorry but this data is not available for sharing.
TypeError screenshot