Open epeters3 opened 4 years ago
When one-hot encoding primitives, how should we treat primitives that exist in a validation or test dataset but not in a training dataset?
This one is tricky. From an ML perspective, learning the behavior of a primitive is difficult. Making a prediction about unseen algorithms is beyond impossible. Thus any splitting should be stratified so that all primitives are represented in the meta-training data. From a software perspective, we should be able to handle unseen primitives without crashing.
In the past, I have pushed for stratified splits so we don't have to handle unseen primitives. @epeters3 Are there other considerations I should be thinking about?
How should we treat pipeline DAG structures that are seen in a validation or test dataset but not in the training set?
Which models cannot handle new DAG structures, besides PMF (see below)? All models should be able to do this. I think there was some effort to encode the DAG structure for linear regression. Is that happening?
How should we treat pipelines that exist in a validation or test dataset but not in the training set? The probabilistic_matrix_factorization model expects all pipeline IDs seen in the test dataset to also be in the training dataset. A simple solution here would again be to map all unseen pipelines to a single unknown pipeline.
Probabilistic Matrix Factorization has not been fully supported. This type of model solves a slightly different problem. We can just leave this one out.
Thanks for the comments Brandon.
Making a prediction about unseen algorithms is beyond impossible. Thus any splitting should be stratified so that all primitives are represented in the meta-training data.
There is a perhaps orthogonal argument to this. In production, the incoming data cannot always be controlled, and sometimes there is benefit in handling unseen values. But I agree about the stratification for these experiments; I think that's a good solution that I will look in to doing.
Which models cannot handle new DAG structures, besides PMF (see below)? All models should be able to do this.
I am getting errors when running the daglstm_regression
model:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 1126, in <module>
main(sys.argv)
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 1122, in main
handler(arguments, parser)
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 1065, in handler
evaluate_handler(arguments)
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 349, in evaluate_handler
handle_evaluate(model_config, arguments)
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 315, in handle_evaluate
model_output_dir=model_output_dir, plot_dir=plot_dir
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/__main__.py", line 243, in evaluate
test_data, model, model_config, verbose=verbose, model_output_dir=model_output_dir
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/problems.py", line 62, in predict
predictions = model_predict_method(data, verbose=verbose, **model_predict_config)
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/models/base_models.py", line 260, in predict_regression
predictions, targets = self._predict_epoch(data_loader, self._model, verbose=verbose)
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/models/base_models.py", line 193, in _predict_epoch
for x_batch, y_batch in data_loader:
File "/users/grads/epeter92/code/byu-dml/d3m-dynamic-neural-architecture/dna/data.py", line 552, in _iter
group_structure = self.pipeline_structures[group]
KeyError: 'inputs.0012334674910811512313'
I think there was some effort to encode the DAG structure for linear regression. Is that happening?
I don't know. Do you mean the linear regression baseline, or all the models that do the regression task of estimating pipeline performance directly?
I am getting errors when running the daglstm_regression model:
That's right, I think Erik added some functionality to encode the DAG structure. That KeyError I think is showing the encoding mechanism he created. For the very issue you are seeing, I don't think we should be encoding the structure.
Do you mean the linear regression baseline, or all the models that do the regression task
I mean the linear regression baseline.
Concerning changing the data splitting feature to support stratifying by primitives, I noticed the meta datasets are already being split by dataset i.e. all pipeline runs for a certain dataset are kept together in their meta dataset. So, if there are certain primitives that are only used in a single dataset, that dataset would necessarily have to be in the training meta dataset.
if there are certain primitives that are only used in a single dataset, that dataset would necessarily have to be in the training meta dataset.
Great edge case! Is this hypothetical or do we have actual cases in the database? I imagine this would happen with niche tasks and corresponding niche primitives to support the task, but hopefully, there are at least 2 examples of such tasks.
I haven't checked to see, I'll look into it. First I need to think of a clean way to both group by dataset and make sure all primitives are in the training meta dataset.
I need to think of a clean way to both group by dataset and make sure all primitives are in the training meta dataset
I think I implemented the splitting from scratch, but sklearn provides grouped splitting methods. Maybe explore their splitting methods to see if there is something that fits our needs.
It looks like they only have support for standard stratification (honoring the original distribution of some label in both resulting datasets), and for splitting by a single group label. I can think of a way to post process a training and test dataset, to swap groups until all primitives are found in the training dataset, but the split wouldn't be random anymore.
I believe having unseen values in your test dataset is fairly common. It seems like ensuring the training set has all primitives present in it would be like ensuring our training split contains more useful information in it than it would in a random split. In a way we are giving ourselves a more favorable split than we should. Am I right? Or is having all primitives in the training set kosher?
Is there no support for grouped and stratified? Maybe this is a little too specific...
It seems like ensuring the training set has all primitives present in it would be like ensuring our training split contains more useful information in it than it would in a random split.
Having novel primitives in the test set would demonstrate how well we can predict the performance of pipelines containing primitives we have never seen before. In the simplest (extreme) case, this is like trying to train a metamodel on decision tree scores and then trying to estimate support vector machine scores. We have no hope of being able to do this.
Creating a random split is also a way of simulating the independent, identically distributed assumption. If we have novel primitives in the test set, we will be trying to estimate the performance of pipelines that are distributed differently than those in the training data. While this may be interesting empirically, I don't yet see any theoretical reason to do it.
I see that this is proving to be difficult from an engineering perspective. I think I am more inclined to catch the novel primitive errors than trying to support novel primitives. What do you think of having models raise a custom error (e.g. UnknownPrimitiveError
) and then recording that this error occurred in lieu of a score? Then when analyzing the data, we can report numbers on this type of failure.
Ideally, we would use leave one group out cross-validation. In the case of niche tasks/primitives, we might still have novel primitives, as you noted above. In this case, our metamodels would not be able to give a confident estimate of performance. That said, maybe a poor estimate is better than no estimate, in which case we would need to implement the "unknown primitive" category as you have proposed.
I like the IID discussion, thank you. I agree now that it is ok to include all the primitives in the training set.
It doesn't look like sklearn has something that would make this easy for us. I can think of ways to ensure the training data set has all the primitives, but no clear way to ensure the distributions we care about are still IID at the same time. One thing I could try is to, after the train/test split is made, try randomly swapping pipeline run groups between train and test, and check the "all primitives present" condition after each swap, ending once all primitives are present in the training set. Or, considering the case where a single swap may not satisfy the condition, iterations of swaps could be done.
Per our team meeting last week, we've decided to go with the approach of ensuring the training set always has all the primitives present in it.
The Issue
We have altered
dna/database_to_json.py
to now pull its data from a local mirror of the D3M Metalearning Database. The data in that database has a little more variety than the data this repo has been working with in terms of pipeline DAGs and primitives used. Using D3M data raises a couple issues for this repo:probabilistic_matrix_factorization
model expects all pipeline IDs seen in the test dataset to also be in the training dataset. A simple solution here would again be to map all unseen pipelines to a singleunknown
pipeline.@bjschoenfeld, would you be willing to share your thoughts on these bullet points?
Progress Made
I consider this repo able to fully support D3M data when all these models can train and score successfully on the data (using the
dna evaluate
command):probabilistic_matrix_factorization