Trouble loading datasets with schema #20

Open jim-schwoebel opened 4 years ago

jim-schwoebel commented 4 years ago


Thanks for making this repository.

I have attached a dataset I've been trying to load into AutoBazaar. I think I formatted everything according to the schema; however, for some reason I can't get the CLI interface to recognize it.

d90baf0-53b9-44a0-9dc7-438b7951aec5$ abz list No matching datasets found

Any ideas?

csala commented 4 years ago

Hello @jim-schwoebel

The problem is that the dataset that you provided is missing the problem folder within it.

We also realized that the README pointed at 2 example datasets that were never included in the repository, so I just added them in the PR #21

Can you can use them as an example to format yours and try again?

jim-schwoebel commented 4 years ago

Absolutely - thanks for getting back so quickly. I'll let you know how it goes.

jim-schwoebel commented 4 years ago

That was the main reason I was lost really - the docs were missing there. I think I have a much better idea on how the schema needs to be structured. I really like the work your lab has done here - looks like an excellent way to represent multiple dataset types, etc.

jim-schwoebel commented 4 years ago

Ok I ran into another problem - just running the default example:

jim@DESKTOP-MBFTMVI:/mnt/c/users/jimsc/desktop/autobazaar$ abz search 185_baseball -c10,20,30 -b10
Using TensorFlow backend.
20200424162051101747 - Processing Datasets: ['185_baseball']
#### Searching 185_baseball ####
2020-04-24 12:20:51,108 - 408 - ERROR - search - Problem type not supported single_table/classification/multiClass
Dataset 185_baseball failed on step SEARCH with error UnsupportedProblem - single_table/classification/multiClass
Traceback (most recent call last):
  File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/", line 226, in _score_dataset
    args.checkpoints, args.splits, args.db, args.tuner_type, args.test_id
  File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/", line 89, in _search_pipeline
    return, template, budget=budget, checkpoints=checkpoints)
  File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/", line 442, in search
    self._setup_search(d3mds, budget, checkpoints, template_name)
  File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/", line 405, in _setup_search
    self.template_dict = self._get_template(template_name)
  File "/home/jim/.local/lib/python3.6/site-packages/autobazaar/", line 258, in _get_template
    raise UnsupportedProblem(problem_type) single_table/classification/multiClass
              pipeline score  rank cv_score   metric data_modality       task_type task_subtype  elapsed iterations load_time trivial_time cv_time                                              error    step
185_baseball       NaN  None  None     None  f1Macro  single_table  classification  multi_class  0.00767       None      None         None    None  UnsupportedProblem - single_table/classificati...  SEARCH

Here is my current list of dependencies (pip3 list):

jim-schwoebel commented 4 years ago

I also tried on my mac computer (in virtual environment) and have the same error.

For this build, I started with the original requirements:

jimschwoebel@Jims-MBP autobazaar % virtualenv env  
jimschwoebel@Jims-MBP autobazaar % source env/bin/activate  
(env) jimschwoebel@Jims-MBP autobazaar % pip3 install autobazaar 

I then got this error:

(env) jimschwoebel@Jims-MBP autobazaar % abz list                               
Traceback (most recent call last):
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/bin/abz", line 5, in <module>
    from autobazaar.__main__ import main
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/", line 16, in <module>
    import git
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/", line 38, in <module>
    from git.exc import *                       # @NoMove @IgnorePep8
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/", line 9, in <module>
    from git.compat import UnicodeMixin, safe_decode, string_types
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/", line 16, in <module>
    from gitdb.utils.compat import (
ModuleNotFoundError: No module named 'gitdb.utils.compat'
(env) jimschwoebel@Jims-MBP autobazaar % abz list                               
Traceback (most recent call last):
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/bin/abz", line 5, in <module>
    from autobazaar.__main__ import main
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/", line 16, in <module>
    import git
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/", line 38, in <module>
    from git.exc import *                       # @NoMove @IgnorePep8
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/", line 9, in <module>
    from git.compat import UnicodeMixin, safe_decode, string_types
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/git/", line 16, in <module>
    from gitdb.utils.compat import (
ModuleNotFoundError: No module named 'gitdb.utils.compat'

It looked like a versioning issue with gitdb, so I downgraded it:

pip3 install gitdb==0.6.4

Datasets can now be found:

(env) jimschwoebel@Jims-MBP autobazaar % abz list
             data_modality       task_type task_subtype            metric size_human  train_samples
185_baseball  single_table  classification  multi_class           f1Macro       140K           1073
196_autoMpg   single_table      regression   univariate  meanSquaredError        24K            298

However, the error still arises:

(env) jimschwoebel@Jims-MBP autobazaar % abz search 185_baseball -c10,20,30 -b10
Using TensorFlow backend.
20200424165201388474 - Processing Datasets: ['185_baseball']
#### Searching 185_baseball ####
2020-04-24 12:52:01,399 - 5746 - ERROR - search - Problem type not supported single_table/classification/multiClass
Dataset 185_baseball failed on step SEARCH with error UnsupportedProblem - single_table/classification/multiClass
Traceback (most recent call last):
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/", line 226, in _score_dataset
    args.checkpoints, args.splits, args.db, args.tuner_type, args.test_id
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/", line 89, in _search_pipeline
    return, template, budget=budget, checkpoints=checkpoints)
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/", line 442, in search
    self._setup_search(d3mds, budget, checkpoints, template_name)
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/", line 405, in _setup_search
    self.template_dict = self._get_template(template_name)
  File "/Users/jimschwoebel/Desktop/AutoBazaar/env/lib/python3.7/site-packages/autobazaar/", line 258, in _get_template
    raise UnsupportedProblem(problem_type) single_table/classification/multiClass
              pipeline score  rank cv_score   metric data_modality       task_type task_subtype   elapsed iterations load_time trivial_time cv_time                                              error    step
185_baseball       NaN  None  None     None  f1Macro  single_table  classification  multi_class  0.005461       None      None         None    None  UnsupportedProblem - single_table/classificati...  SEARCH
csala commented 4 years ago

Thanks for the detailed repor @jim-schwoebel !

I figured out what the problem is. Would you mind trying to install from the repo itself instead of using the pypi autobazaar version?

Inside the root of the repository, you can execute make install-develop and it will install the local version.

This should work without issues.

I'm also preparing a new release to PyPI that will fix the current error.

jim-schwoebel commented 4 years ago

Awesome - I'll go ahead and do this now and let you know

jim-schwoebel commented 4 years ago

Ok cool - I recloned the repo, set up a virtual environment with (make install-develop) and ran the test datasets and everything seems to be working. Thanks for helping out here

(env) jimschwoebel@Jims-MBP autobazaar % abz search 185_baseball -c10,20,30 -b10
Using TensorFlow backend.
20200424170301363738 - Processing Datasets: ['185_baseball']
#### Searching 185_baseball ####

2020-04-24 13:06:17,496 - 27303 - WARNING - search - Stop Time already passed. Stopping Search!
#### Executing 185_baseball ####
Executing best pipeline ABPipeline({
    "primitives": [
    "init_params": {},
    "input_names": {},
    "output_names": {},
    "hyperparameters": {
        "mlprimitives.custom.preprocessing.ClassEncoder#1": {},
        "mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
            "keep": false,
            "copy": true,
            "features": "auto",
            "max_unique_ratio": 0,
            "max_labels": 0
        "sklearn.impute.SimpleImputer#1": {
            "missing_values": NaN,
            "fill_value": null,
            "verbose": false,
            "copy": true,
            "strategy": "mean"
        "sklearn.preprocessing.RobustScaler#1": {
            "quantile_range": [
            "copy": true,
            "with_centering": true,
            "with_scaling": true
        "xgboost.XGBClassifier#1": {
            "n_jobs": -1,
            "n_estimators": 300,
            "max_depth": 3,
            "learning_rate": 0.1,
            "gamma": 0,
            "min_child_weight": 1
        "mlprimitives.custom.preprocessing.ClassDecoder#1": {}
    "tunable_hyperparameters": {
        "mlprimitives.custom.preprocessing.ClassEncoder#1": {},
        "mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
            "max_labels": {
                "type": "int",
                "default": 0,
                "range": [
        "sklearn.impute.SimpleImputer#1": {
            "strategy": {
                "type": "str",
                "default": "mean",
                "values": [
        "sklearn.preprocessing.RobustScaler#1": {
            "with_centering": {
                "description": "If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory",
                "type": "bool",
                "default": true
            "with_scaling": {
                "description": "If True, scale the data to interquartile range",
                "type": "bool",
                "default": true
        "xgboost.XGBClassifier#1": {
            "n_estimators": {
                "type": "int",
                "default": 100,
                "range": [
            "max_depth": {
                "type": "int",
                "default": 3,
                "range": [
            "learning_rate": {
                "type": "float",
                "default": 0.1,
                "range": [
            "gamma": {
                "type": "float",
                "default": 0,
                "range": [
            "min_child_weight": {
                "type": "int",
                "default": 1,
                "range": [
        "mlprimitives.custom.preprocessing.ClassDecoder#1": {}
    "outputs": {
        "default": [
                "name": "y",
                "type": "ndarray",
                "variable": "mlprimitives.custom.preprocessing.ClassDecoder#1.y"
    "id": "47fe3473-908e-463e-8956-c1ead391a44a",
    "name": "single_table/classification/default",
    "template": null,
    "loader": {
        "data_modality": "single_table",
        "task_type": "classification"
    "score": 0.6325421243549755,
    "rank": 0.3674578756453524,
    "metric": "f1Macro"
#### Scoring 185_baseball ####
Score: 0.7003230687441212
       predictions     targets
count   267.000000  267.000000
mean      0.086142    0.146067
std       0.373052    0.480066
min       0.000000    0.000000
25%       0.000000    0.000000
50%       0.000000    0.000000
75%       0.000000    0.000000
max       2.000000    2.000000
                                          pipeline     score      rank  cv_score   metric data_modality       task_type task_subtype     elapsed  iterations  load_time  trivial_time     cv_time error  step
185_baseball  47fe3473-908e-463e-8956-c1ead391a44a  0.700323  0.367458  0.632542  f1Macro  single_table  classification  multi_class  154.047028         1.0   0.017557      0.083713  143.929863  None  None

The list of dependencies is below in case anyone needs them (output as requirements.txt. requirements.txt

csala commented 4 years ago

Great! I'm glad it helped!

I leave this open until we make the new release and this is fixed on the PyPI version.

jim-schwoebel commented 4 years ago

So I finally got all this to work locally - and transformed the data to make enable model training with any arbitrary dataset that I've created.

I'm running into some trouble pickling the models and making predictions. Are the params and pickle files ready to make predictions?

I have attached the input and output folders here locally to give you more context.

I figure this may come up again from others

(terminal output below from training session):

20200424233839634675 - Processing Datasets: ['Battlecry_Cashregister_standard_features_btb_classification']
#### Searching Battlecry_Cashregister_standard_features_btb_classification ####
#### Executing Battlecry_Cashregister_standard_features_btb_classification ####
Executing best pipeline ABPipeline({
    "primitives": [
    "init_params": {},
    "input_names": {},
    "output_names": {},
    "hyperparameters": {
        "mlprimitives.custom.preprocessing.ClassEncoder#1": {},
        "mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
            "keep": false,
            "copy": true,
            "features": "auto",
            "max_unique_ratio": 0,
            "max_labels": 1
        "sklearn.impute.SimpleImputer#1": {
            "missing_values": NaN,
            "fill_value": null,
            "verbose": false,
            "copy": true,
            "strategy": "most_frequent"
        "sklearn.preprocessing.RobustScaler#1": {
            "quantile_range": [
            "copy": true,
            "with_centering": false,
            "with_scaling": false
        "xgboost.XGBClassifier#1": {
            "n_jobs": -1,
            "n_estimators": 301,
            "max_depth": 5,
            "learning_rate": 0.3170186161309039,
            "gamma": 0.4698212882025645,
            "min_child_weight": 3
        "mlprimitives.custom.preprocessing.ClassDecoder#1": {}
    "tunable_hyperparameters": {
        "mlprimitives.custom.preprocessing.ClassEncoder#1": {},
        "mlprimitives.custom.feature_extraction.CategoricalEncoder#1": {
            "max_labels": {
                "type": "int",
                "default": 0,
                "range": [
        "sklearn.impute.SimpleImputer#1": {
            "strategy": {
                "type": "str",
                "default": "mean",
                "values": [
        "sklearn.preprocessing.RobustScaler#1": {
            "with_centering": {
                "description": "If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory",
                "type": "bool",
                "default": true
            "with_scaling": {
                "description": "If True, scale the data to interquartile range",
                "type": "bool",
                "default": true
        "xgboost.XGBClassifier#1": {
            "n_estimators": {
                "type": "int",
                "default": 100,
                "range": [
            "max_depth": {
                "type": "int",
                "default": 3,
                "range": [
            "learning_rate": {
                "type": "float",
                "default": 0.1,
                "range": [
            "gamma": {
                "type": "float",
                "default": 0,
                "range": [
            "min_child_weight": {
                "type": "int",
                "default": 1,
                "range": [
        "mlprimitives.custom.preprocessing.ClassDecoder#1": {}
    "outputs": {
        "default": [
                "name": "y",
                "type": "ndarray",
                "variable": "mlprimitives.custom.preprocessing.ClassDecoder#1.y"
    "id": "b4387b9f-a9d6-4f24-8b39-3558fe0c116c",
    "name": "single_table/classification/default",
    "template": null,
    "loader": {
        "data_modality": "single_table",
        "task_type": "classification"
    "score": 0.9888888888888889,
    "rank": 0.011111111111134577,
    "metric": "accuracy"
#### Scoring Battlecry_Cashregister_standard_features_btb_classification ####
Score: 0.9565217391304348
       predictions    targets
count    23.000000  23.000000
mean      0.347826   0.304348
std       0.486985   0.470472
min       0.000000   0.000000
25%       0.000000   0.000000
50%       0.000000   0.000000
75%       1.000000   1.000000
max       1.000000   1.000000
                                                                                pipeline     score      rank  cv_score    metric data_modality       task_type task_subtype   elapsed  iterations  load_time  trivial_time    cv_time error  step
Battlecry_Cashregister_standard_features_btb_cl...  b4387b9f-a9d6-4f24-8b39-3558fe0c116c  0.956522  0.011111  0.988889  accuracy  single_table  classification  multi_class  15.01244        10.0   0.030958      0.049371  14.901585  None  None
jim-schwoebel commented 4 years ago

Here is the .zipped model file and .JSON tune-able parameters.

When I load the model with something like:

import os 

# in OUTPUT folder
# ---------------------
# listdir = os.listdir() 
# --> ['b4387b9f-a9d6-4f24-8b39-3558fe0c116c.json', 'b4387b9f-a9d6-4f24-8b39-3558fe0c116c.pkl']

model=pickle.load(open(picklefile, 'rb'))
--> I get error:   File "/home/jim/.local/lib/python3.6/site-packages/mit_d3m/", line 389, in load
    X, y = d3mds.get_data()
AttributeError: 'numpy.ndarray' object has no attribute 'get_data'

Perhaps I'm not understanding everything in how to load models using the schema - or something with the directory structure is going on?

csala commented 4 years ago

Hi @jim-schwoebel

The problem is that the predict method of the dumped AutoBazaar Pipeline does not expect the raw data as input, but rather a D3MDS object. This is because this method is currently mainly used during the validation step, with the validation data being passed as a D3MDS object:

In [1]: import pickle                                                                                                                 

In [2]: model = pickle.load(open('output/18d11627-47b6-4762-bcb3-8e6b4d632a5b.pkl', 'rb'))                                            

In [3]: model.predict?                                                                                                                
Signature: model.predict(d3mds)
Docstring: Get predictions for the given D3MDS.
File:      ~/Projects/MIT/AutoBazaar/autobazaar/
Type:      method

However, you can still access the predict method of the underlying MLBlocks pipeline if you access it through the pipeline atribute:

In [4]: model.pipeline.predict?                                                                                                       
Signature: model.pipeline.predict(X=None, output_='default', start_=None, **kwargs)
Produce predictions using the blocks of this pipeline.

Sequentially call the ``produce`` method of each block, capturing the
outputs before calling the next one.

During the whole process a context dictionary is built, where both the
passed arguments and the captured outputs of the ``produce`` methods
are stored, and from which the arguments for the next ``produce`` calls
will be taken.

        Data which the pipeline will use to make predictions.

    output_ (str or int or list or None):
        Output specification, as required by ``get_outputs``. If not specified
        the ``default`` output will be returned.

    start_ (str or int or None):
        Block index or block name to start processing from. The
        value can either be an integer, which will be interpreted as a block index,
        or the name of a block, including the conter number at the end.
        If given, the execution of the pipeline will start on the specified block,
        and all the blocks before that one will be skipped.

        Any additional keyword arguments will be directly added
        to the context dictionary and available for the blocks.

    object or tuple:
        * If a single output is requested, it is returned alone.
        * If multiple outputs have been requested, a tuple is returned.
File:      ~/.virtualenvs/AutoBazaar/lib/python3.6/site-packages/mlblocks/
Type:      method

So, when it comes to making predictions you have two options:

  1. Craft a D3MDS object with the data that you want to make predictions on (practical for testing and validation, not practical for final application use cases)
  2. Pass your data directly to the model.pipeline.predict
jim-schwoebel commented 4 years ago

Okay awesome - that makes a bit more sense. I'll try this with the new docs ^^ and let you know if I have any further issues.

MariumAZ commented 4 years ago

Hi @jim-schwoebel , actually I am wondering how did you filled the datasetDoc.json when you have more than 200 attribute columns . I've tried to upload my own dataset however it didn't work for me .. Please is there any particular files missing ? tree_view

jim-schwoebel commented 4 years ago

Pasting some custom code I wrote below that may be useful if you are formatting your own datasets for this ML framework. Note that you must specify whether the problem is classification or regression with some metrics using the D3M Schema Format:

def create_dataset_json(foldername, trainingcsv):

    # create the template .JSON file necessary for the featurization


    for i in range(len(colnames)):
        if colnames[i] != 'class_':
            columns.append({"colIndex": i,
                        "colName": colnames[i],
                        "colType": "real",
                        "role": ["attribute"]})
            columns.append({"colIndex": i,
                        "colName": 'class_',
                        "colType": "real",
                        "role": ["suggestedTarget"]})   

      "datasetID": dataset_id,
      "humanSubjectsResearch": False,
          "resID": "0",
          "resPath": 'tables/learningData.csv',
          "resType": "table",
          "resFormat": ["text/csv"],
          "isCollection": False,


    return dataset_id, filename, i1

def create_problem_json(mtype, folder,i1):

    if mtype == 'c':
        data = {
          "about": {
            "problemID": "%s_problem"%(folder),
            "problemName": "%s_problem"%(folder),
            "problemDescription": "not applicable",
            "taskType": "classification",
            "taskSubType": "multiClass",
            "problemVersion": "1.0",
            "problemSchemaVersion": "3.0"
          "inputs": {
            "data": [
                "datasetID": "%s"%(folder),
                "targets": [
                    "targetIndex": 0,
                    "resID": "0",
                    "colIndex": i1,
                    "colName": 'class_',
            "dataSplits": {
              "method": "holdOut",
              "testSize": 0.2,
              "stratified": True,
              "numRepeats": 0,
              "randomSeed": 42,
              "splitsFile": "dataSplits.csv"
            "performanceMetrics": [
                "metric": "accuracy"
          "expectedOutputs": {
            "predictionsFile": "predictions.csv"

    elif mtype == 'r':
        data={"about": {
                "problemID": "%s_problem"%(folder),
                "problemName": "%s_problem"%(folder),
                "problemDescription": "not applicable",
                "taskType": "regression",
                "taskSubType": "univariate",
                "problemVersion": "1.0",
                "problemSchemaVersion": "3.0"
              "inputs": {
                "data": [
                    "datasetID": "%s_dataset"%(folder),
                    "targets": [
                        "targetIndex": 0,
                        "resID": "0",
                        "colIndex": i1,
                        "colName": "class_"
                "dataSplits": {
                  "method": "holdOut",
                  "testSize": 0.2,
                  "stratified": True,
                  "numRepeats": 0,
                  "randomSeed": 42,
                  "splitsFile": "dataSplits.csv"
                "performanceMetrics": [
                    "metric": "meanSquaredError"
              "expectedOutputs": {
                "predictionsFile": "predictions.csv"


Feel free to use this if it helps you with formatting the datasetDoc.json and problemDoc.json for a numerical array.

MariumAZ commented 4 years ago

@jim-schwoebel thank you so much , I'll try it and let you know :)

micahjsmith commented 4 years ago

@jim-schwoebel @MariumAZ I made another approach to formatting a CSV file in the D3M format with subdirectories for splits that may be useful: