Closed mglowacki100 closed 7 years ago
@mglowacki100 : i probably haven't updated the pypi release yet. you're working on the bleeding edge here :) i'll update that momentarily, then i'd love to hear your feedback. sorry for mentioning it in the other thread before it was included on pip!
@mglowacki100 the release with ml_predictor.train(blah_blah, scoring='log_loss')
is out on pip! let me know what (if anything) you encounter
also, i was working on a quick example for numer.ai last night. here's what i had so far:
from auto_ml import Predictor
import datetime
import dill
import pandas as pd
from sklearn.model_selection import train_test_split
df_train = pd.read_csv(os.path.join('numerai_datasets', 'numerai_training_data.csv'))
# Split out 10% of our data to calibrate our probability predictions on
df_train, df_calibrate = train_test_split(df_train, test_size=0.1)
col_descs = {
'target': 'output'
}
ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=col_descs)
ml_predictor.train(df_train, optimize_final_model=False, perform_feature_selection=False, perform_feature_scaling=False, X_test=df_calibrate, y_test=df_calibrate.target, calibrate_final_model=True, scoring='log_loss')
file_name = ml_predictor.save('numerai_model_' + str(datetime.datetime.now()))
with open(file_name, 'rb') as read_file:
trained_model = dill.load(read_file)
df_tournament = pd.read_csv(os.path.join('numerai_datasets', 'numerai_tournament_data.csv'))
predictions = trained_model.predict_proba(df_tournament)
print(predictions)
obviously, this assumes you're already in the directory where you've downloaded the "numerai_datasets" folder. i think the major thing that's left is to format the output the way numer.ai expects it, and then save to csv. but i'd love to hear any improvements you have on this script! it's also using calibrate_final_model
as a param in .train(), which is another new feature that's on the bleeding edge, and as such, undocumented at the moment. but if you pass in X_test, y_text, and calibrate_final_model=True, it should run some calibration on the probability predictions. not sure how useful that is on the numer.ai dataset, but hey, that's why we still have humans in the loop!
Thanks a lot! It seems to work fine :+1:
During auto_ml pip update, I've got one error in last line, I don't if this is meaningful:
mglowacki@mglowacki:~$ sudo -H pip install auto_ml --upgrade
[sudo] password for mglowacki:
Collecting auto_ml
Downloading auto_ml-1.9.1-py2.py3-none-any.whl (44kB)
100% |████████████████████████████████| 51kB 435kB/s
Collecting scikit-learn (from auto_ml)
Downloading scikit_learn-0.18.1-cp27-cp27mu-manylinux1_x86_64.whl (11.6MB)
100% |████████████████████████████████| 11.7MB 149kB/s
Collecting scipy (from auto_ml)
Downloading scipy-0.18.1-cp27-cp27mu-manylinux1_x86_64.whl (40.3MB)
100% |████████████████████████████████| 40.3MB 47kB/s
Collecting pandas (from auto_ml)
Downloading pandas-0.19.1-cp27-cp27mu-manylinux1_x86_64.whl (16.7MB)
100% |████████████████████████████████| 16.7MB 109kB/s
Collecting python-dateutil (from auto_ml)
Downloading python_dateutil-2.6.0-py2.py3-none-any.whl (194kB)
100% |████████████████████████████████| 194kB 788kB/s
Requirement already up-to-date: pathos in /usr/local/lib/python2.7/dist-packages (from auto_ml)
Collecting pytz>=2011k (from pandas->auto_ml)
Downloading pytz-2016.10-py2.py3-none-any.whl (483kB)
100% |████████████████████████████████| 491kB 256kB/s
Requirement already up-to-date: numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas->auto_ml)
Requirement already up-to-date: six>=1.5 in /usr/lib/python2.7/dist-packages (from python-dateutil->auto_ml)
Requirement already up-to-date: multiprocess>=0.70.4 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Requirement already up-to-date: pox>=0.2.2 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Requirement already up-to-date: ppft>=1.6.4.5 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Requirement already up-to-date: dill>=0.2.5 in /usr/local/lib/python2.7/dist-packages (from pathos->auto_ml)
Installing collected packages: scikit-learn, scipy, pytz, python-dateutil, pandas, auto-ml
Found existing installation: scikit-learn 0.18
Uninstalling scikit-learn-0.18:
Successfully uninstalled scikit-learn-0.18
Found existing installation: scipy 0.18.0rc2
Uninstalling scipy-0.18.0rc2:
Successfully uninstalled scipy-0.18.0rc2
Found existing installation: pytz 2016.6.1
Uninstalling pytz-2016.6.1:
Successfully uninstalled pytz-2016.6.1
Found existing installation: python-dateutil 2.5.3
Uninstalling python-dateutil-2.5.3:
Successfully uninstalled python-dateutil-2.5.3
Found existing installation: pandas 0.18.1
Uninstalling pandas-0.18.1:
Successfully uninstalled pandas-0.18.1
Found existing installation: auto-ml 1.9
Uninstalling auto-ml-1.9:
Successfully uninstalled auto-ml-1.9
Successfully installed auto-ml-1.9.1 pandas-0.19.1 python-dateutil-2.6.0 pytz-2016.10 scikit-learn-0.18.1 scipy-0.18.1
Traceback (most recent call last):
File "/usr/local/bin/pip", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/pip/__init__.py", line 233, in main
return command.main(cmd_args)
File "/usr/local/lib/python2.7/dist-packages/pip/basecommand.py", line 252, in main
pip_version_check(session)
File "/usr/local/lib/python2.7/dist-packages/pip/utils/outdated.py", line 102, in pip_version_check
installed_version = get_installed_version("pip")
File "/usr/local/lib/python2.7/dist-packages/pip/utils/__init__.py", line 838, in get_installed_version
working_set = pkg_resources.WorkingSet()
File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 644, in __init__
self.add_entry(entry)
File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 700, in add_entry
for dist in find_distributions(entry, True):
File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1949, in find_eggs_in_zip
if metadata.has_metadata('PKG-INFO'):
File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1463, in has_metadata
return self.egg_info and self._has(self._fn(self.egg_info, name))
File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1823, in _has
return zip_path in self.zipinfo or zip_path in self._index()
File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1703, in zipinfo
return self._zip_manifests.load(self.loader.archive)
File "/usr/local/lib/python2.7/dist-packages/pip/_vendor/pkg_resources/__init__.py", line 1643, in load
mtime = os.stat(path).st_mtime
OSError: [Errno 2] No such file or directory: '/usr/local/lib/python2.7/dist-packages/pytz-2016.6.1-py2.7.egg'
huh, that's a weird error. as long as it all works correctly, we'll archive that for now. thanks for sending over though- i always like having more data available when trying to debug things!
Keep filing any other issues you run into, even if they're just usability issues. Or if you have ideas to update the docs with, I'd love a PR or two to make things more obvious.
I think they have good examples: https://rhiever.github.io/tpot/examples/MNIST_Example/ For sure, if I encounter problem I'd report it.
I've tried to change scoring by method you describe in #148:
But I get following error:
Without scoring, it works fine.
As input I use data from numer.ai. I've looked at http://auto-ml.readthedocs.io/en/latest/api_docs_for_geeks.html but I didn't find info how to change scoring/metric and there is no scoring in API, from the other side there is scoring in header of train:
def train(self, raw_training_data, user_input_func=None, optimize_entire_pipeline=False, optimize_final_model=None, write_gs_param_results_to_file=True, perform_feature_selection=None, verbose=True, X_test=None, y_test=None, print_training_summary_to_viewer=True, ml_for_analytics=True, only_analytics=False, compute_power=3, take_log_of_y=None, model_names=None, perform_feature_scaling=True, ensembler=None, calibrate_final_model=False, _include_original_X=False, _scorer=None, scoring=None):