ClimbsRocks / auto_ml

[UNMAINTAINED] Automated machine learning for analytics & production
http://auto-ml.readthedocs.io
MIT License
1.64k stars 310 forks source link

Problem with scoring #150

Closed mglowacki100 closed 7 years ago

mglowacki100 commented 7 years ago

I've tried to change scoring by method you describe in #148:

import pandas as pd
from sklearn.model_selection import train_test_split
from auto_ml import Predictor

d=10
print("file: "+str(d)+"\n")
train = pd.read_csv('input2/numerai_training_data_'+str(d)+'.csv')

train, test = train_test_split(train, test_size=0.25, random_state=42)

column_descriptions = {
    'target': 'output'
}

ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=column_descriptions)
ml_predictor.train(train, compute_power=2, scoring='log_loss')
test_score = ml_predictor.score(test, test.target)
print test_score

But I get following error:

>>> runfile('/home/mglowacki/Desktop/NAI7/automlNAI2.py', wdir='/home/mglowacki/Desktop/NAI7')
file: 10

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)
  File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 81, in execfile
    builtins.execfile(filename, *where)
  File "/home/mglowacki/Desktop/NAI7/automlNAI2.py", line 16, in <module>
    ml_predictor.train(train, compute_power=2, scoring='log_loss')
TypeError: train() got an unexpected keyword argument 'scoring'

Without scoring, it works fine.

As input I use data from numer.ai. I've looked at http://auto-ml.readthedocs.io/en/latest/api_docs_for_geeks.html but I didn't find info how to change scoring/metric and there is no scoring in API, from the other side there is scoring in header of train: def train(self, raw_training_data, user_input_func=None, optimize_entire_pipeline=False, optimize_final_model=None, write_gs_param_results_to_file=True, perform_feature_selection=None, verbose=True, X_test=None, y_test=None, print_training_summary_to_viewer=True, ml_for_analytics=True, only_analytics=False, compute_power=3, take_log_of_y=None, model_names=None, perform_feature_scaling=True, ensembler=None, calibrate_final_model=False, _include_original_X=False, _scorer=None, scoring=None):

ClimbsRocks commented 7 years ago

@mglowacki100 : i probably haven't updated the pypi release yet. you're working on the bleeding edge here :) i'll update that momentarily, then i'd love to hear your feedback. sorry for mentioning it in the other thread before it was included on pip!

ClimbsRocks commented 7 years ago

@mglowacki100 the release with ml_predictor.train(blah_blah, scoring='log_loss') is out on pip! let me know what (if anything) you encounter

ClimbsRocks commented 7 years ago

also, i was working on a quick example for numer.ai last night. here's what i had so far:

from auto_ml import Predictor

import datetime
import dill
import pandas as pd
from sklearn.model_selection import train_test_split

df_train = pd.read_csv(os.path.join('numerai_datasets', 'numerai_training_data.csv'))
# Split out 10% of our data to calibrate our probability predictions on
df_train, df_calibrate = train_test_split(df_train, test_size=0.1)

col_descs = {
    'target': 'output'
}

ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=col_descs)

ml_predictor.train(df_train, optimize_final_model=False, perform_feature_selection=False, perform_feature_scaling=False, X_test=df_calibrate, y_test=df_calibrate.target, calibrate_final_model=True, scoring='log_loss')

file_name = ml_predictor.save('numerai_model_' + str(datetime.datetime.now()))

with open(file_name, 'rb') as read_file:
    trained_model = dill.load(read_file)

df_tournament = pd.read_csv(os.path.join('numerai_datasets', 'numerai_tournament_data.csv'))

predictions = trained_model.predict_proba(df_tournament)

print(predictions)

obviously, this assumes you're already in the directory where you've downloaded the "numerai_datasets" folder. i think the major thing that's left is to format the output the way numer.ai expects it, and then save to csv. but i'd love to hear any improvements you have on this script! it's also using calibrate_final_model as a param in .train(), which is another new feature that's on the bleeding edge, and as such, undocumented at the moment. but if you pass in X_test, y_text, and calibrate_final_model=True, it should run some calibration on the probability predictions. not sure how useful that is on the numer.ai dataset, but hey, that's why we still have humans in the loop!

mglowacki100 commented 7 years ago

Thanks a lot! It seems to work fine :+1:

During auto_ml pip update, I've got one error in last line, I don't if this is meaningful:

mglowacki@mglowacki:~$ sudo -H pip install auto_ml --upgrade
[sudo] password for mglowacki: 
Collecting auto_ml
  Downloading auto_ml-1.9.1-py2.py3-none-any.whl (44kB)
    100% |████████████████████████████████| 51kB 435kB/s 
Collecting scikit-learn (from auto_ml)
  Downloading scikit_learn-0.18.1-cp27-cp27mu-manylinux1_x86_64.whl (11.6MB)
    100% |████████████████████████████████| 11.7MB 149kB/s 
Collecting scipy (from auto_ml)
  Downloading scipy-0.18.1-cp27-cp27mu-manylinux1_x86_64.whl (40.3MB)
    100% |████████████████████████████████| 40.3MB 47kB/s 
Collecting pandas (from auto_ml)
  Downloading pandas-0.19.1-cp27-cp27mu-manylinux1_x86_64.whl (16.7MB)
    100% |████████████████████████████████| 16.7MB 109kB/s 
Collecting python-dateutil (from auto_ml)
  Downloading python_dateutil-2.6.0-py2.py3-none-any.whl (194kB)
    100% |████████████████████████████████| 194kB 788kB/s 
Requirement already up-to-date: pathos in /usr/local/lib/python2.7/dist-packages (from auto_ml)
Collecting pytz>=2011k (from pandas->auto_ml)
  Downloading pytz-2016.10-py2.py3-none-any.whl (483kB)
    100% |████████████████████████████████| 491kB 256kB/s 
Requirement already up-to-date: numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas->auto_ml)
Requirement already up-to-date: six>=1.5 in /usr/lib/python2.7/dist-packages (from python-dateutil->auto_ml)
Requirement already up-to-date: multiprocess>=0.70.4 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Requirement already up-to-date: pox>=0.2.2 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Requirement already up-to-date: ppft>=1.6.4.5 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Requirement already up-to-date: dill>=0.2.5 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Installing collected packages: scikit-learn, scipy, pytz, python-dateutil, pandas, auto-ml
  Found existing installation: scikit-learn 0.18
    Uninstalling scikit-learn-0.18:
      Successfully uninstalled scikit-learn-0.18
  Found existing installation: scipy 0.18.0rc2
    Uninstalling scipy-0.18.0rc2:
      Successfully uninstalled scipy-0.18.0rc2
  Found existing installation: pytz 2016.6.1
    Uninstalling pytz-2016.6.1:
      Successfully uninstalled pytz-2016.6.1
  Found existing installation: python-dateutil 2.5.3
    Uninstalling python-dateutil-2.5.3:
      Successfully uninstalled python-dateutil-2.5.3
  Found existing installation: pandas 0.18.1
    Uninstalling pandas-0.18.1:
      Successfully uninstalled pandas-0.18.1
  Found existing installation: auto-ml 1.9
    Uninstalling auto-ml-1.9:
      Successfully uninstalled auto-ml-1.9
Successfully installed auto-ml-1.9.1 pandas-0.19.1 python-dateutil-2.6.0 pytz-2016.10 scikit-learn-0.18.1 scipy-0.18.1
Traceback (most recent call last):
  File "/usr/local/bin/pip", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/pip/__init__.py", line 233, in main
    return command.main(cmd_args)
  File "/usr/local/lib/python2.7/dist-packages/pip/basecommand.py", line 252, in main
    pip_version_check(session)
  File "/usr/local/lib/python2.7/dist-packages/pip/utils/outdated.py", line 102, in pip_version_check
    installed_version = get_installed_version("pip")
  File "/usr/local/lib/python2.7/dist-packages/pip/utils/__init__.py", line 838, in get_installed_version
    working_set = pkg_resources.WorkingSet()
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 644, in __init__
    self.add_entry(entry)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 700, in add_entry
    for dist in find_distributions(entry, True):
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1949, in find_eggs_in_zip
    if metadata.has_metadata('PKG-INFO'):
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1463, in has_metadata
    return self.egg_info and self._has(self._fn(self.egg_info, name))
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1823, in _has
    return zip_path in self.zipinfo or zip_path in self._index()
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1703, in zipinfo
    return self._zip_manifests.load(self.loader.archive)
  File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1643, in load
    mtime = os.stat(path).st_mtime
OSError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/pytz-2016.6.1-py2.7.egg'
ClimbsRocks commented 7 years ago

huh, that's a weird error. as long as it all works correctly, we'll archive that for now. thanks for sending over though- i always like having more data available when trying to debug things!

Keep filing any other issues you run into, even if they're just usability issues. Or if you have ideas to update the docs with, I'd love a PR or two to make things more obvious.

mglowacki100 commented 7 years ago

I think they have good examples: https://rhiever.github.io/tpot/examples/MNIST_Example/ For sure, if I encounter problem I'd report it.