Closed akodate closed 8 years ago
thanks for filing! i do know that issue, and thought i'd squashed it already. give me a minute to check on that, and fix it if it's still bugging.
yep, found the bug. i wasn't thorough enough when squashing it last time. i'll have a fix out shortly.
Thanks, I'm looking forward to being able to try out auto_ml.
just pushed the fix! i'll publish to pypi later tonight probably with a slew of other updates i'm making. but you can pull down from github for now.
thanks again for filing this bug!
i'm honestly surprised the test_script runs at all. i haven't touched that in months, and the project's gone through some pretty significant performance tuning and new feature additions since the initial launch.
if you're up for it, i'd love a copy of your code to use as a quick example to show people how it's used! it shouldn't take nearly as many lines of code as test_script.py makes it seem.
For my initial attempt, I was just using the example you provided in the documentation (rather than the code in test_script.py):
from auto_ml import Predictor
col_desc_dictionary = {'target': 'output'}
ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions=col_desc_dictionary)
ml_predictor.train(df_dict)
ml_predictor.predict(tournament_dict)
Thanks to the build you just pushed, the code works for me now (although I'm encountering trouble with other features like predict_proba—you may see me raise another issue.)
If you'd like to see my equivalent for test_script.py, I'd be happy to provide it to you. It should be quite short, as you predicted.
ooh- i'd love to hear the issue you're running into with predict_proba! and to see your example script. thanks for the feedback so far.
My example script looks a bit like this:
import pandas as pd
from sklearn.cross_validation import train_test_split
from auto_ml import Predictor
df = pd.read_csv('your/path/to/numerai_training_data.csv')
training_data, testing_data = train_test_split(df, test_size=.2)
ml_predictor = Predictor(type_of_estimator='classifier', column_descriptions={'target': 'output'})
ml_predictor.train(training_data)
X_test = testing_data.drop('target', axis=1)
y_test = testing_data['target']
print(ml_predictor.score(X_test, y_test))
I didn't even know auto_ml could take dataframes until I saw test_script.py—that seems to be a (very useful) undocumented feature.
Unfortunately I only get a score of about -0.25 on the current numerai training data (even when I change the test set size), but if you have any suggestions I'd be very interested.
@akodate: ah, crap- i hadn't realized that most people are probably interpreting the score for classifiers as accuracy, rather than the brier-score-loss.
i've been wanting to improve our scoring logging for a little while now, but this gives me all the impetus i need. i'll try to get that clarification pushed this weekend!
the numer.ai dataset is a bunch of stocks, so every percentage point of accuracy above 50% is a huge bump. but that said, you should be seeing some improvement above 50%. again though, the default scoring metric reported on is brier-score-loss, not accuracy.
I'm trying to run your "Getting Started" example on the numerai training data and getting the following error:
Are you familiar with this type of issue?