Support D3M DB Meta Datasets

epeters3 commented 4 years ago

The Issue

We have altered dna/database_to_json.py to now pull its data from a local mirror of the D3M Metalearning Database. The data in that database has a little more variety than the data this repo has been working with in terms of pipeline DAGs and primitives used. Using D3M data raises a couple issues for this repo:

When one-hot encoding primitives, how should we treat primitives that exist in a validation or test dataset but not in a training dataset?
How should we treat pipeline DAG structures that are seen in a validation or test dataset but not in the training set?
How should we treat pipelines that exist in a validation or test dataset but not in the training set? The probabilistic_matrix_factorization model expects all pipeline IDs seen in the test dataset to also be in the training dataset. A simple solution here would again be to map all unseen pipelines to a single unknown pipeline.

@bjschoenfeld, would you be willing to share your thoughts on these bullet points?

Progress Made

I consider this repo able to fully support D3M data when all these models can train and score successfully on the data (using the dna evaluate command):

[x] dna_regression (addressed by #206)
[x] mean_regression
[x] median_regression
[x] per_primitive_regression (addressed by #206)
[x] autosklearn (addressed by #206)
[x] lstm (addressed by #206)
[x] daglstm_regression (addressed by #206)
[x] hidden_daglstm_regression (addressed by #206)
[x] attention_regression (addressed by #206)
[x] dag_attention_regression (addressed by #206)
[x] linear_regression (addressed by #206)
[x] random_forest (addressed by #206)
[x] mlp_regression
[x] random
[x] meta_autosklearn (addressed by #206)
[x] ~~probabilistic_matrix_factorization~~

bjschoenfeld commented 4 years ago

When one-hot encoding primitives, how should we treat primitives that exist in a validation or test dataset but not in a training dataset?

This one is tricky. From an ML perspective, learning the behavior of a primitive is difficult. Making a prediction about unseen algorithms is beyond impossible. Thus any splitting should be stratified so that all primitives are represented in the meta-training data. From a software perspective, we should be able to handle unseen primitives without crashing.

In the past, I have pushed for stratified splits so we don't have to handle unseen primitives. @epeters3 Are there other considerations I should be thinking about?

How should we treat pipeline DAG structures that are seen in a validation or test dataset but not in the training set?

Which models cannot handle new DAG structures, besides PMF (see below)? All models should be able to do this. I think there was some effort to encode the DAG structure for linear regression. Is that happening?

How should we treat pipelines that exist in a validation or test dataset but not in the training set? The probabilistic_matrix_factorization model expects all pipeline IDs seen in the test dataset to also be in the training dataset. A simple solution here would again be to map all unseen pipelines to a single unknown pipeline.

Probabilistic Matrix Factorization has not been fully supported. This type of model solves a slightly different problem. We can just leave this one out.

epeters3 commented 4 years ago

Thanks for the comments Brandon.

Making a prediction about unseen algorithms is beyond impossible. Thus any splitting should be stratified so that all primitives are represented in the meta-training data.

There is a perhaps orthogonal argument to this. In production, the incoming data cannot always be controlled, and sometimes there is benefit in handling unseen values. But I agree about the stratification for these experiments; I think that's a good solution that I will look in to doing.

Which models cannot handle new DAG structures, besides PMF (see below)? All models should be able to do this.

I am getting errors when running the daglstm_regression model:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 1126, in <module>
    main(sys.argv)
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 1122, in main
    handler(arguments, parser)
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 1065, in handler
    evaluate_handler(arguments)
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 349, in evaluate_handler
    handle_evaluate(model_config, arguments)
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 315, in handle_evaluate
    model_output_dir=model_output_dir, plot_dir=plot_dir
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 243, in evaluate
    test_data, model, model_config, verbose=verbose, model_output_dir=model_output_dir
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/problems.py", line 62, in predict
    predictions = model_predict_method(data, verbose=verbose, **model_predict_config)
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/models/base_models.py", line 260, in predict_regression
    predictions, targets = self._predict_epoch(data_loader, self._model, verbose=verbose)
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/models/base_models.py", line 193, in _predict_epoch
    for x_batch, y_batch in data_loader:
  File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/data.py", line 552, in _iter
    group_structure = self.pipeline_structures[group]
KeyError: 'inputs.0012334674910811512313'

epeters3 commented 4 years ago

I think there was some effort to encode the DAG structure for linear regression. Is that happening?

I don't know. Do you mean the linear regression baseline, or all the models that do the regression task of estimating pipeline performance directly?

bjschoenfeld commented 4 years ago

I am getting errors when running the daglstm_regression model:

That's right, I think Erik added some functionality to encode the DAG structure. That KeyError I think is showing the encoding mechanism he created. For the very issue you are seeing, I don't think we should be encoding the structure.

Do you mean the linear regression baseline, or all the models that do the regression task

I mean the linear regression baseline.

epeters3 commented 4 years ago

Concerning changing the data splitting feature to support stratifying by primitives, I noticed the meta datasets are already being split by dataset i.e. all pipeline runs for a certain dataset are kept together in their meta dataset. So, if there are certain primitives that are only used in a single dataset, that dataset would necessarily have to be in the training meta dataset.

bjschoenfeld commented 4 years ago

if there are certain primitives that are only used in a single dataset, that dataset would necessarily have to be in the training meta dataset.

Great edge case! Is this hypothetical or do we have actual cases in the database? I imagine this would happen with niche tasks and corresponding niche primitives to support the task, but hopefully, there are at least 2 examples of such tasks.

epeters3 commented 4 years ago

I haven't checked to see, I'll look into it. First I need to think of a clean way to both group by dataset and make sure all primitives are in the training meta dataset.

bjschoenfeld commented 4 years ago

I need to think of a clean way to both group by dataset and make sure all primitives are in the training meta dataset

I think I implemented the splitting from scratch, but sklearn provides grouped splitting methods. Maybe explore their splitting methods to see if there is something that fits our needs.

epeters3 commented 4 years ago

It looks like they only have support for standard stratification (honoring the original distribution of some label in both resulting datasets), and for splitting by a single group label. I can think of a way to post process a training and test dataset, to swap groups until all primitives are found in the training dataset, but the split wouldn't be random anymore.

epeters3 commented 4 years ago

I believe having unseen values in your test dataset is fairly common. It seems like ensuring the training set has all primitives present in it would be like ensuring our training split contains more useful information in it than it would in a random split. In a way we are giving ourselves a more favorable split than we should. Am I right? Or is having all primitives in the training set kosher?

bjschoenfeld commented 4 years ago

Is there no support for grouped and stratified? Maybe this is a little too specific...

It seems like ensuring the training set has all primitives present in it would be like ensuring our training split contains more useful information in it than it would in a random split.

Having novel primitives in the test set would demonstrate how well we can predict the performance of pipelines containing primitives we have never seen before. In the simplest (extreme) case, this is like trying to train a metamodel on decision tree scores and then trying to estimate support vector machine scores. We have no hope of being able to do this.

Creating a random split is also a way of simulating the independent, identically distributed assumption. If we have novel primitives in the test set, we will be trying to estimate the performance of pipelines that are distributed differently than those in the training data. While this may be interesting empirically, I don't yet see any theoretical reason to do it.

I see that this is proving to be difficult from an engineering perspective. I think I am more inclined to catch the novel primitive errors than trying to support novel primitives. What do you think of having models raise a custom error (e.g. UnknownPrimitiveError) and then recording that this error occurred in lieu of a score? Then when analyzing the data, we can report numbers on this type of failure.

bjschoenfeld commented 4 years ago

Ideally, we would use leave one group out cross-validation. In the case of niche tasks/primitives, we might still have novel primitives, as you noted above. In this case, our metamodels would not be able to give a confident estimate of performance. That said, maybe a poor estimate is better than no estimate, in which case we would need to implement the "unknown primitive" category as you have proposed.

epeters3 commented 4 years ago

I like the IID discussion, thank you. I agree now that it is ok to include all the primitives in the training set.

It doesn't look like sklearn has something that would make this easy for us. I can think of ways to ensure the training data set has all the primitives, but no clear way to ensure the distributions we care about are still IID at the same time. One thing I could try is to, after the train/test split is made, try randomly swapping pipeline run groups between train and test, and check the "all primitives present" condition after each swap, ending once all primitives are present in the training set. Or, considering the case where a single swap may not satisfy the condition, iterations of swaps could be done.

epeters3 commented 4 years ago

Per our team meeting last week, we've decided to go with the approach of ensuring the training set always has all the primitives present in it.

byu-dml / d3m-dynamic-neural-architecture

Support D3M DB Meta Datasets #205

The Issue

Progress Made