Open jim-schwoebel opened 4 years ago
Hello @jim-schwoebel
The problem is that the dataset that you provided is missing the problem
folder within it.
We also realized that the README pointed at 2 example datasets that were never included in the repository, so I just added them in the PR #21
Can you can use them as an example to format yours and try again?
Absolutely - thanks for getting back so quickly. I'll let you know how it goes.
That was the main reason I was lost really - the docs were missing there. I think I have a much better idea on how the schema needs to be structured. I really like the work your lab has done here - looks like an excellent way to represent multiple dataset types, etc.
Ok I ran into another problem - just running the default example:
jim@DESKTOP-MBFTMVI:/mnt/c/users/jimsc/desktop/autobazaar$ abz search 185_baseball -c10,20,30 -b10
Using TensorFlow backend.
20200424162051101747 - Processing Datasets: ['185_baseball']
################################
#### Searching 185_baseball ####
################################
2020-04-24 12:20:51,108 - 408 - ERROR - search - Problem type not supported single_table/classification/multiClass
Dataset 185_baseball failed on step SEARCH with error UnsupportedProblem - single_table/classification/multiClass
Traceback (most recent call last):
File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/__main__.py", line 226, in _score_dataset
args.checkpoints, args.splits, args.db, args.tuner_type, args.test_id
File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/__main__.py", line 89, in _search_pipeline
return searcher.search(d3mds, template, budget=budget, checkpoints=checkpoints)
File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/search.py", line 442, in search
self._setup_search(d3mds, budget, checkpoints, template_name)
File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/search.py", line 405, in _setup_search
self.template_dict = self._get_template(template_name)
File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/search.py", line 258, in _get_template
raise UnsupportedProblem(problem_type)
autobazaar.search.UnsupportedProblem: single_table/classification/multiClass
pipeline score rank cv_score metric data_modality task_type task_subtype elapsed iterations load_time trivial_time cv_time error step
dataset
185_baseball NaN None None None f1Macro single_table classification multi_class 0.00767 None None None None UnsupportedProblem - single_table/classificati... SEARCH
Here is my current list of dependencies (pip3 list):
absl-py 0.9.0
adanet 0.8.0
aiohttp 3.6.2
alphapy 2.4.2
aniso8601 8.0.0
appnope 0.1.0
arrow 0.15.5
asn1crypto 0.24.0
astor 0.8.1
async-timeout 3.0.1
asyncio 3.4.3
atm 0.2.2
attrs 19.3.0
audioread 2.1.8
autobazaar 0.2.0
autogbt 0.0.1
autogluon 0.0.6
autokaggle 0.1.0
autokeras 1.0.0
Automat 0.6.0
autoPyTorch 0.0.2
backcall 0.1.0
baytune 0.2.5
bcrypt 3.1.7
beautifulsoup4 4.8.2
beautifultable 0.8.0
bleach 3.1.4
blinker 1.4
blis 0.4.1
bokeh 2.0.0
boto3 1.9.253
botocore 1.12.253
cachetools 4.0.0
catalogue 1.0.0
catboost 0.22
category-encoders 2.1.0
certifi 2019.11.28
cffi 1.14.0
chardet 3.0.4
click 7.1.1
cliff 3.1.0
cloud-init 19.4
cloudpickle 1.3.0
cmake 3.16.3
cmd2 0.8.9
colorama 0.4.3
colorlog 4.0.2
command-not-found 0.3
configobj 5.0.6
ConfigSpace 0.4.10
constantly 15.1.0
conv 0.2
coverage 4.5.4
cryptography 2.9
cvopt 0.4.3
cycler 0.10.0
cymem 2.0.3
Cython 0.29.15
dask 2.6.0
dataclasses 0.7
dcase-util 0.2.11
deap 1.3.1
decorator 4.4.2
defusedxml 0.6.0
deprecation 2.0.7
distributed 2.6.0
distro-info 0.18ubuntu0.18.04.1
docutils 0.15.2
EasyProcess 0.2.10
empyrical 0.5.3
en-core-web-sm 2.2.5
entrypoint2 0.2
entrypoints 0.3
enum34 1.1.10
eyeD3 0.9.4
fasteners 0.15
featuretools 0.11.0
ffmpeg-normalize 1.15.8
ffmpy 0.2.2
filelock 3.0.12
filetype 1.0.6
Flask 1.1.2
Flask-RESTful 0.3.8
Flask-Restless 0.17.0
flask-restless-swagger-2 0.0.3
Flask-SQLAlchemy 2.4.1
fsspec 0.7.3
funcsigs 1.0.2
funcy 1.14
future 0.18.2
fuzzywuzzy 0.18.0
gama 20.1.0
gast 0.2.2
gensim 3.8.1
gitdb 0.6.4
gitdb2 2.0.6
GitPython 3.0.2
gluoncv 0.6.0
gluonnlp 0.8.1
google 2.0.3
google-api-core 1.16.0
google-auth 1.11.3
google-auth-oauthlib 0.4.1
google-cloud-core 1.3.0
google-cloud-storage 1.26.0
google-pasta 0.2.0
google-resumable-media 0.5.0
googleapis-common-protos 1.51.0
GPy 1.9.9
GPyOpt 1.2.6
graphviz 0.8.4
grpcio 1.27.2
h5py 2.10.0
HeapDict 1.0.1
hpbandster 0.7.4
hpsklearn 0.0.3 /mnt/c/users/jimsc/desktop/allie/training/helpers/hyperopt-sklearn
httplib2 0.9.2
hyperlink 17.3.1
hyperopt 0.2.3
idna 2.9
idna-ssl 1.1.0
iexfinance 0.4.3
imageio 2.8.0
imageio-ffmpeg 0.4.1
imbalanced-learn 0.6.2
imblearn 0.0
importlib-metadata 1.5.0
incremental 16.10.1
ipykernel 5.2.1
ipython 7.13.0
ipython-genutils 0.2.0
ipywidgets 7.5.1
iso639 0.1.4
itsdangerous 1.1.0
jedi 0.16.0
Jinja2 2.11.1
jmespath 0.9.5
joblib 0.14.1
jsonpatch 1.16
jsonpointer 1.10
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.3
jupyter-console 6.1.0
jupyter-core 4.6.3
kafka-python 2.0.1
Keras 2.3.1
Keras-Applications 1.0.8
keras-compressor 0.0.1
Keras-Preprocessing 1.1.0
keras-squeezenet 0.4
keras-tuner 1.0.1
keyring 10.6.0
keyrings.alt 3.0
kiwisolver 1.1.0
kneed 0.6.0
langdetect 1.0.8
language-selector 0.1
liac-arff 2.4.0
librosa 0.6.2
lightfm 1.15
lightgbm 2.3.1
llvmlite 0.31.0
lockfile 0.12.2
ludwig 0.2.2.3
lxml 4.2.4
Markdown 3.2.1
markovify 0.8.0
MarkupSafe 1.1.1
matplotlib 3.0.3
mimerender 0.6.0
mistune 0.8.4
mit-d3m 0.2.1
mlblocks 0.3.4
mlbox 0.8.4
mlprimitives 0.2.4
mock 3.0.5
monotonic 1.5
more-itertools 8.2.0
MouseInfo 0.1.2
moviepy 1.0.1
msgpack 1.0.0
multidict 4.7.5
murmurhash 1.0.2
mxnet 1.6.0
natsort 7.0.1
nbconvert 5.6.1
nbformat 5.0.6
netifaces 0.10.4
networkx 2.2
neuraxle 0.4.0
nltk 3.4.5
nose 1.3.7
notebook 6.0.3
numba 0.48.0
numexpr 2.7.1
numpy 1.16.5
oauthlib 3.1.0
opencv-contrib-python 3.4.2.16
opencv-python 3.4.2.16
openml 0.10.2
opt-einsum 3.2.0
optuna 0.7.0
packaging 20.3
pafy 0.5.5
PAM 0.4.2
pandas 0.24.2
pandas-datareader 0.8.1
pandocfilters 1.4.2
paramiko 2.7.1
paramz 0.9.5
parso 0.6.2
patsy 0.5.1
pbr 5.4.5
PeakUtils 1.3.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 7.0.0
pip 20.0.2
plac 1.1.3
plotly 4.6.0
pluggy 0.13.1
pocketsphinx 0.1.15
portalocker 1.7.0
praat-parselmouth 0.3.3
preshed 3.0.2
prettytable 0.7.2
proglog 0.1.9
prometheus-client 0.7.1
prompt-toolkit 3.0.3
protobuf 3.11.3
psutil 5.7.0
ptyprocess 0.6.0
py 1.8.1
py-spy 0.3.3
pyaml 20.4.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
PyAudio 0.2.11
pyAudioAnalysis 0.2.5
PyAutoGUI 0.9.48
pycairo 1.16.2
pycparser 2.20
pycrypto 2.6.1
pydot-ng 2.0.0
pydub 0.23.1
pyfolio 0.9.2
PyGetWindow 0.0.8
Pygments 2.6.1
pygobject 3.26.1
PyJWT 1.5.3
pymongo 3.10.1
PyMsgBox 1.0.7
PyMySQL 0.9.3
PyNaCl 1.3.0
pynisher 0.5.0
pyOpenSSL 17.5.0
pyparsing 2.4.6
pyperclip 1.7.0
PyQt5 5.10.1
PyRect 0.1.4
Pyro4 4.79
pyrsistent 0.15.7
pyscreenshot 1.0
PyScreeze 0.1.26
pyserial 3.4
pytesseract 0.3.3
pytest 5.4.1
python-apt 1.6.5+ubuntu0.2
python-daemon 2.2.4
python-dateutil 2.8.0
python-debian 0.1.32
python-Levenshtein 0.12.0
python-louvain 0.13
python-magic 0.4.15
python-mimeparse 1.6.0
python-speech-features 0.6
python3-xlib 0.15
pytube 9.6.0
PyTweening 1.0.3
pytz 2019.3
PyWavelets 1.1.1
pyworld 0.2.8
pyxattr 0.6.0
pyxdg 0.25
PyYAML 5.3.1
pyzmq 19.0.0
qtconsole 4.7.3
QtPy 1.9.0
ray 0.8.2
readchar 2.0.1
redis 3.4.1
rednose 1.3.0
requests 2.23.0
requests-oauthlib 1.3.0
requests-unixsocket 0.1.5
resampy 0.2.2
retrying 1.3.3
rsa 4.0
ruptures 1.0.3
s3fs 0.4.2
s3transfer 0.2.1
safe-transformer 0.0.5
scikit-hyperband 0.0.1
scikit-image 0.14.5
scikit-learn 0.20.4
scikit-optimize 0.7.4
scikit-video 1.1.11
scipy 1.3.3
seaborn 0.10.0
SecretStorage 2.3.1
Send2Trash 1.5.0
serpent 1.30.2
service-identity 16.0.0
setproctitle 1.1.10
setuptools 46.1.3
simplejson 3.17.0
sip 4.19.8
six 1.14.0
sklearn 0.0
smart-open 1.10.0
smmap 3.0.2
smmap2 3.0.1
sortedcontainers 2.1.0
sounddevice 0.3.15
SoundFile 0.10.3.post1
soupsieve 2.0
sox 1.3.7
spacy 2.2.4
SpeechRecognition 3.8.1
SQLAlchemy 1.3.16
srsly 1.0.2
ssh-import-id 5.7
statsmodels 0.11.1
stevedore 1.32.0
stopit 1.1.2
subprocess32 3.5.4
systemd-python 234
tables 3.5.2
tabulate 0.8.7
tblib 1.6.0
tensorboard 1.15.0
tensorboard-logger 0.1.0
tensorflow 1.15.2
tensorflow-estimator 1.15.1
termcolor 1.1.0
terminado 0.8.3
terminaltables 3.1.0
termstyle 0.1.11
testpath 0.4.4
textblob 0.15.3
tf-slim 1.0
thinc 7.4.0
toolz 0.10.0
torch 1.5.0
torchvision 0.6.0
tornado 6.0.4
TPOT 0.11.1
tqdm 4.43.0
traitlets 4.3.3
Twisted 17.9.0
typing 3.7.4.1
typing-extensions 3.7.4.1
ufw 0.36
unattended-upgrades 0.1
Unidecode 1.1.1
update-checker 0.16
urllib3 1.25.8
uuid 1.30
validators 0.14.2
wasabi 0.6.0
Wave 0.0.2
wcwidth 0.1.9
webencodings 0.5.1
webrtcvad 2.0.10
Werkzeug 1.0.0
wget 3.2
wheel 0.30.0
widgetsnbextension 3.5.1
wrapt 1.12.1
xgboost 0.90
xlrd 1.2.0
XlsxWriter 1.2.8
xmltodict 0.12.0
yarl 1.4.2
yellowbrick 1.1
youtube-dl 2018.3.14
zict 2.0.0
zipp 3.1.0
zope.interface 4.3.2
I also tried on my mac computer (in virtual environment) and have the same error.
For this build, I started with the original requirements:
jimschwoebel@Jims-MBP autobazaar % virtualenv env
jimschwoebel@Jims-MBP autobazaar % source env/bin/activate
(env) jimschwoebel@Jims-MBP autobazaar % pip3 install autobazaar
I then got this error:
(env) jimschwoebel@Jims-MBP autobazaar % abz list
Traceback (most recent call last):
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/bin/abz", line 5, in <module>
from autobazaar.__main__ import main
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/__init__.py", line 16, in <module>
import git
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/__init__.py", line 38, in <module>
from git.exc import * # @NoMove @IgnorePep8
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/exc.py", line 9, in <module>
from git.compat import UnicodeMixin, safe_decode, string_types
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/compat.py", line 16, in <module>
from gitdb.utils.compat import (
ModuleNotFoundError: No module named 'gitdb.utils.compat'
(env) jimschwoebel@Jims-MBP autobazaar % abz list
Traceback (most recent call last):
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/bin/abz", line 5, in <module>
from autobazaar.__main__ import main
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/__init__.py", line 16, in <module>
import git
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/__init__.py", line 38, in <module>
from git.exc import * # @NoMove @IgnorePep8
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/exc.py", line 9, in <module>
from git.compat import UnicodeMixin, safe_decode, string_types
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/compat.py", line 16, in <module>
from gitdb.utils.compat import (
ModuleNotFoundError: No module named 'gitdb.utils.compat'
It looked like a versioning issue with gitdb, so I downgraded it:
pip3 install gitdb==0.6.4
Datasets can now be found:
(env) jimschwoebel@Jims-MBP autobazaar % abz list
data_modality task_type task_subtype metric size_human train_samples
dataset
185_baseball single_table classification multi_class f1Macro 140K 1073
196_autoMpg single_table regression univariate meanSquaredError 24K 298
However, the error still arises:
(env) jimschwoebel@Jims-MBP autobazaar % abz search 185_baseball -c10,20,30 -b10
Using TensorFlow backend.
20200424165201388474 - Processing Datasets: ['185_baseball']
################################
#### Searching 185_baseball ####
################################
2020-04-24 12:52:01,399 - 5746 - ERROR - search - Problem type not supported single_table/classification/multiClass
Dataset 185_baseball failed on step SEARCH with error UnsupportedProblem - single_table/classification/multiClass
Traceback (most recent call last):
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/__main__.py", line 226, in _score_dataset
args.checkpoints, args.splits, args.db, args.tuner_type, args.test_id
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/__main__.py", line 89, in _search_pipeline
return searcher.search(d3mds, template, budget=budget, checkpoints=checkpoints)
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/search.py", line 442, in search
self._setup_search(d3mds, budget, checkpoints, template_name)
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/search.py", line 405, in _setup_search
self.template_dict = self._get_template(template_name)
File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/search.py", line 258, in _get_template
raise UnsupportedProblem(problem_type)
autobazaar.search.UnsupportedProblem: single_table/classification/multiClass
pipeline score rank cv_score metric data_modality task_type task_subtype elapsed iterations load_time trivial_time cv_time error step
dataset
185_baseball NaN None None None f1Macro single_table classification multi_class 0.005461 None None None None UnsupportedProblem - single_table/classificati... SEARCH
Thanks for the detailed repor @jim-schwoebel !
I figured out what the problem is. Would you mind trying to install from the repo itself instead of using the pypi autobazaar
version?
Inside the root of the repository, you can execute make install-develop
and it will install the local version.
This should work without issues.
I'm also preparing a new release to PyPI that will fix the current error.
Awesome - I'll go ahead and do this now and let you know
Ok cool - I recloned the repo, set up a virtual environment with (make install-develop
) and ran the test datasets and everything seems to be working. Thanks for helping out here
(env) jimschwoebel@Jims-MBP autobazaar % abz search 185_baseball -c10,20,30 -b10
Using TensorFlow backend.
20200424170301363738 - Processing Datasets: ['185_baseball']
################################
#### Searching 185_baseball ####
################################
2020-04-24 13:06:17,496 - 27303 - WARNING - search - Stop Time already passed. Stopping Search!
################################
#### Executing 185_baseball ####
################################
Executing best pipeline ABPipeline({
"primitives": [
"mlprimitives.custom.preprocessing.ClassEncoder",
"mlprimitives.custom.feature_extraction.CategoricalEncoder",
"sklearn.impute.SimpleImputer",
"sklearn.preprocessing.RobustScaler",
"xgboost.XGBClassifier",
"mlprimitives.custom.preprocessing.ClassDecoder"
],
"init_params": {},
"input_names": {},
"output_names": {},
"hyperparameters": {
"mlprimitives.custom.preprocessing.ClassEncoder#1": {},
"mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
"keep": false,
"copy": true,
"features": "auto",
"max_unique_ratio": 0,
"max_labels": 0
},
"sklearn.impute.SimpleImputer#1": {
"missing_values": NaN,
"fill_value": null,
"verbose": false,
"copy": true,
"strategy": "mean"
},
"sklearn.preprocessing.RobustScaler#1": {
"quantile_range": [
25.0,
75.0
],
"copy": true,
"with_centering": true,
"with_scaling": true
},
"xgboost.XGBClassifier#1": {
"n_jobs": -1,
"n_estimators": 300,
"max_depth": 3,
"learning_rate": 0.1,
"gamma": 0,
"min_child_weight": 1
},
"mlprimitives.custom.preprocessing.ClassDecoder#1": {}
},
"tunable_hyperparameters": {
"mlprimitives.custom.preprocessing.ClassEncoder#1": {},
"mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
"max_labels": {
"type": "int",
"default": 0,
"range": [
0,
100
]
}
},
"sklearn.impute.SimpleImputer#1": {
"strategy": {
"type": "str",
"default": "mean",
"values": [
"mean",
"median",
"most_frequent",
"constant"
]
}
},
"sklearn.preprocessing.RobustScaler#1": {
"with_centering": {
"description": "If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory",
"type": "bool",
"default": true
},
"with_scaling": {
"description": "If True, scale the data to interquartile range",
"type": "bool",
"default": true
}
},
"xgboost.XGBClassifier#1": {
"n_estimators": {
"type": "int",
"default": 100,
"range": [
10,
1000
]
},
"max_depth": {
"type": "int",
"default": 3,
"range": [
3,
10
]
},
"learning_rate": {
"type": "float",
"default": 0.1,
"range": [
0,
1
]
},
"gamma": {
"type": "float",
"default": 0,
"range": [
0,
1
]
},
"min_child_weight": {
"type": "int",
"default": 1,
"range": [
1,
10
]
}
},
"mlprimitives.custom.preprocessing.ClassDecoder#1": {}
},
"outputs": {
"default": [
{
"name": "y",
"type": "ndarray",
"variable": "mlprimitives.custom.preprocessing.ClassDecoder#1.y"
}
]
},
"id": "47fe3473-908e-463e-8956-c1ead391a44a",
"name": "single_table/classification/default",
"template": null,
"loader": {
"data_modality": "single_table",
"task_type": "classification"
},
"score": 0.6325421243549755,
"rank": 0.3674578756453524,
"metric": "f1Macro"
})
##############################
#### Scoring 185_baseball ####
##############################
Score: 0.7003230687441212
predictions targets
count 267.000000 267.000000
mean 0.086142 0.146067
std 0.373052 0.480066
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 0.000000
max 2.000000 2.000000
pipeline score rank cv_score metric data_modality task_type task_subtype elapsed iterations load_time trivial_time cv_time error step
dataset
185_baseball 47fe3473-908e-463e-8956-c1ead391a44a 0.700323 0.367458 0.632542 f1Macro single_table classification multi_class 154.047028 1.0 0.017557 0.083713 143.929863 None None
The list of dependencies is below in case anyone needs them (output as requirements.txt. requirements.txt
Great! I'm glad it helped!
I leave this open until we make the new release and this is fixed on the PyPI version.
So I finally got all this to work locally - and transformed the data to make enable model training with any arbitrary dataset that I've created.
I'm running into some trouble pickling the models and making predictions. Are the params and pickle files ready to make predictions?
I have attached the input and output folders here locally to give you more context.
I figure this may come up again from others
(terminal output below from training session):
20200424233839634675 - Processing Datasets: ['Battlecry_Cashregister_standard_features_btb_classification']
###############################################################################
#### Searching Battlecry_Cashregister_standard_features_btb_classification ####
###############################################################################
###############################################################################
#### Executing Battlecry_Cashregister_standard_features_btb_classification ####
###############################################################################
Executing best pipeline ABPipeline({
"primitives": [
"mlprimitives.custom.preprocessing.ClassEncoder",
"mlprimitives.custom.feature_extraction.CategoricalEncoder",
"sklearn.impute.SimpleImputer",
"sklearn.preprocessing.RobustScaler",
"xgboost.XGBClassifier",
"mlprimitives.custom.preprocessing.ClassDecoder"
],
"init_params": {},
"input_names": {},
"output_names": {},
"hyperparameters": {
"mlprimitives.custom.preprocessing.ClassEncoder#1": {},
"mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
"keep": false,
"copy": true,
"features": "auto",
"max_unique_ratio": 0,
"max_labels": 1
},
"sklearn.impute.SimpleImputer#1": {
"missing_values": NaN,
"fill_value": null,
"verbose": false,
"copy": true,
"strategy": "most_frequent"
},
"sklearn.preprocessing.RobustScaler#1": {
"quantile_range": [
25.0,
75.0
],
"copy": true,
"with_centering": false,
"with_scaling": false
},
"xgboost.XGBClassifier#1": {
"n_jobs": -1,
"n_estimators": 301,
"max_depth": 5,
"learning_rate": 0.3170186161309039,
"gamma": 0.4698212882025645,
"min_child_weight": 3
},
"mlprimitives.custom.preprocessing.ClassDecoder#1": {}
},
"tunable_hyperparameters": {
"mlprimitives.custom.preprocessing.ClassEncoder#1": {},
"mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
"max_labels": {
"type": "int",
"default": 0,
"range": [
0,
100
]
}
},
"sklearn.impute.SimpleImputer#1": {
"strategy": {
"type": "str",
"default": "mean",
"values": [
"mean",
"median",
"most_frequent",
"constant"
]
}
},
"sklearn.preprocessing.RobustScaler#1": {
"with_centering": {
"description": "If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory",
"type": "bool",
"default": true
},
"with_scaling": {
"description": "If True, scale the data to interquartile range",
"type": "bool",
"default": true
}
},
"xgboost.XGBClassifier#1": {
"n_estimators": {
"type": "int",
"default": 100,
"range": [
10,
1000
]
},
"max_depth": {
"type": "int",
"default": 3,
"range": [
3,
10
]
},
"learning_rate": {
"type": "float",
"default": 0.1,
"range": [
0,
1
]
},
"gamma": {
"type": "float",
"default": 0,
"range": [
0,
1
]
},
"min_child_weight": {
"type": "int",
"default": 1,
"range": [
1,
10
]
}
},
"mlprimitives.custom.preprocessing.ClassDecoder#1": {}
},
"outputs": {
"default": [
{
"name": "y",
"type": "ndarray",
"variable": "mlprimitives.custom.preprocessing.ClassDecoder#1.y"
}
]
},
"id": "b4387b9f-a9d6-4f24-8b39-3558fe0c116c",
"name": "single_table/classification/default",
"template": null,
"loader": {
"data_modality": "single_table",
"task_type": "classification"
},
"score": 0.9888888888888889,
"rank": 0.011111111111134577,
"metric": "accuracy"
})
#############################################################################
#### Scoring Battlecry_Cashregister_standard_features_btb_classification ####
#############################################################################
Score: 0.9565217391304348
predictions targets
count 23.000000 23.000000
mean 0.347826 0.304348
std 0.486985 0.470472
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 1.000000 1.000000
max 1.000000 1.000000
pipeline score rank cv_score metric data_modality task_type task_subtype elapsed iterations load_time trivial_time cv_time error step
dataset
Battlecry_Cashregister_standard_features_btb_cl... b4387b9f-a9d6-4f24-8b39-3558fe0c116c 0.956522 0.011111 0.988889 accuracy single_table classification multi_class 15.01244 10.0 0.030958 0.049371 14.901585 None None
Here is the .zipped model file and .JSON tune-able parameters. 033e915d-da76-4601-9e5f-93978863f825.zip
When I load the model with something like:
import os
# in OUTPUT folder
# ---------------------
# listdir = os.listdir()
# --> ['b4387b9f-a9d6-4f24-8b39-3558fe0c116c.json', 'b4387b9f-a9d6-4f24-8b39-3558fe0c116c.pkl']
model=pickle.load(open(picklefile, 'rb'))
model.predict(X)
--> I get error: File "/home/jim/.local/lib/python3.6/site-packages/mit_d3m/loaders.py", line 389, in load
X, y = d3mds.get_data()
AttributeError: 'numpy.ndarray' object has no attribute 'get_data'
Perhaps I'm not understanding everything in how to load models using the schema - or something with the directory structure is going on?
Hi @jim-schwoebel
The problem is that the predict
method of the dumped AutoBazaar Pipeline does not expect the raw data as input, but rather a D3MDS object. This is because this method is currently mainly used during the validation step, with the validation data being passed as a D3MDS object:
In [1]: import pickle
In [2]: model = pickle.load(open('output/18d11627-47b6-4762-bcb3-8e6b4d632a5b.pkl', 'rb'))
In [3]: model.predict?
Signature: model.predict(d3mds)
Docstring: Get predictions for the given D3MDS.
File: ~/Projects/MIT/AutoBazaar/autobazaar/pipeline.py
Type: method
However, you can still access the predict
method of the underlying MLBlocks pipeline if you access it through the pipeline
atribute:
In [4]: model.pipeline.predict?
Signature: model.pipeline.predict(X=None, output_='default', start_=None, **kwargs)
Docstring:
Produce predictions using the blocks of this pipeline.
Sequentially call the ``produce`` method of each block, capturing the
outputs before calling the next one.
During the whole process a context dictionary is built, where both the
passed arguments and the captured outputs of the ``produce`` methods
are stored, and from which the arguments for the next ``produce`` calls
will be taken.
Args:
X:
Data which the pipeline will use to make predictions.
output_ (str or int or list or None):
Output specification, as required by ``get_outputs``. If not specified
the ``default`` output will be returned.
start_ (str or int or None):
Block index or block name to start processing from. The
value can either be an integer, which will be interpreted as a block index,
or the name of a block, including the conter number at the end.
If given, the execution of the pipeline will start on the specified block,
and all the blocks before that one will be skipped.
**kwargs:
Any additional keyword arguments will be directly added
to the context dictionary and available for the blocks.
Returns:
object or tuple:
* If a single output is requested, it is returned alone.
* If multiple outputs have been requested, a tuple is returned.
File: ~/.virtualenvs/AutoBazaar/lib/python3.6/site-packages/mlblocks/mlpipeline.py
Type: method
So, when it comes to making predictions you have two options:
model.pipeline.predict
Okay awesome - that makes a bit more sense. I'll try this with the new docs ^^ and let you know if I have any further issues.
Hi @jim-schwoebel , actually I am wondering how did you filled the datasetDoc.json when you have more than 200 attribute columns . I've tried to upload my own dataset however it didn't work for me .. Please is there any particular files missing ?
Pasting some custom code I wrote below that may be useful if you are formatting your own datasets for this ML framework. Note that you must specify whether the problem is classification or regression with some metrics using the D3M Schema Format:
def create_dataset_json(foldername, trainingcsv):
# create the template .JSON file necessary for the featurization
dataset_name=foldername
dataset_id="%s_dataset"%(foldername)
columns=list()
colnames=list(pd.read_csv(trainingcsv))
for i in range(len(colnames)):
if colnames[i] != 'class_':
columns.append({"colIndex": i,
"colName": colnames[i],
"colType": "real",
"role": ["attribute"]})
else:
columns.append({"colIndex": i,
"colName": 'class_',
"colType": "real",
"role": ["suggestedTarget"]})
i1=i
data={"about":
{
"datasetID": dataset_id,
"datasetName":dataset_name,
"humanSubjectsResearch": False,
"license":"CC",
"datasetSchemaVersion":"3.0",
"redacted":False
},
"dataResources":
[
{
"resID": "0",
"resPath": 'tables/learningData.csv',
"resType": "table",
"resFormat": ["text/csv"],
"isCollection": False,
"columns":columns,
}
]
}
filename='datasetDoc.json'
jsonfile=open(filename,'w')
json.dump(data,jsonfile)
jsonfile.close()
return dataset_id, filename, i1
def create_problem_json(mtype, folder,i1):
if mtype == 'c':
data = {
"about": {
"problemID": "%s_problem"%(folder),
"problemName": "%s_problem"%(folder),
"problemDescription": "not applicable",
"taskType": "classification",
"taskSubType": "multiClass",
"problemVersion": "1.0",
"problemSchemaVersion": "3.0"
},
"inputs": {
"data": [
{
"datasetID": "%s"%(folder),
"targets": [
{
"targetIndex": 0,
"resID": "0",
"colIndex": i1,
"colName": 'class_',
}
]
}
],
"dataSplits": {
"method": "holdOut",
"testSize": 0.2,
"stratified": True,
"numRepeats": 0,
"randomSeed": 42,
"splitsFile": "dataSplits.csv"
},
"performanceMetrics": [
{
"metric": "accuracy"
}
]
},
"expectedOutputs": {
"predictionsFile": "predictions.csv"
}
}
elif mtype == 'r':
data={"about": {
"problemID": "%s_problem"%(folder),
"problemName": "%s_problem"%(folder),
"problemDescription": "not applicable",
"taskType": "regression",
"taskSubType": "univariate",
"problemVersion": "1.0",
"problemSchemaVersion": "3.0"
},
"inputs": {
"data": [
{
"datasetID": "%s_dataset"%(folder),
"targets": [
{
"targetIndex": 0,
"resID": "0",
"colIndex": i1,
"colName": "class_"
}
]
}
],
"dataSplits": {
"method": "holdOut",
"testSize": 0.2,
"stratified": True,
"numRepeats": 0,
"randomSeed": 42,
"splitsFile": "dataSplits.csv"
},
"performanceMetrics": [
{
"metric": "meanSquaredError"
}
]
},
"expectedOutputs": {
"predictionsFile": "predictions.csv"
}
}
jsonfile=open('problemDoc.json','w')
json.dump(data,jsonfile)
jsonfile.close()
Feel free to use this if it helps you with formatting the datasetDoc.json and problemDoc.json for a numerical array.
@jim-schwoebel thank you so much , I'll try it and let you know :)
@jim-schwoebel @MariumAZ I made another approach to formatting a CSV file in the D3M format with subdirectories for splits that may be useful: https://gist.github.com/micahjsmith/95f5a7e3ef514660123aad1039d04a6d
Hello,
Thanks for making this repository.
I have attached a dataset I've been trying to load into AutoBazaar. I think I formatted everything according to the schema; however, for some reason I can't get the CLI interface to recognize it.
3d90baf0-53b9-44a0-9dc7-438b7951aec5.zip
python -c 'import platform;print(platform.platform())'
): 'Linux-4.4.0-17763-Microsoft-x86_64-with-Ubuntu-18.04-bionic'd90baf0-53b9-44a0-9dc7-438b7951aec5$ abz list No matching datasets found
Any ideas?