Currently the Dataset class is an internal utility for the sklearn module. The idea is to render this class public so it is an utility to create multi-table datasets.
Questions/Ideas
This feature would ease many tasks:
Splitting a dataset in train/test
Sorting a whole dataset
Building core api parameters (notably additional_data_tables)
Simplify tutorials and samples by having a method get_dataset_sample("Accidents", type="pandas")
Main design element: A builder pattern.
Add mutator methods to construct the dataset from an empty one (Dataset())
(Partially) Implemented in prototype:
Classes
PandasDataset
FileDataset
add_table(self, name, source, key=None)
key mandatory for multi-table
source will be different in each Dataset subclass
train_test_split (implemented in PandasDataset only)
sort sorts the dataset by their keys (implemented in FileDataset only)
Description
Currently the
Dataset
class is an internal utility for thesklearn
module. The idea is to render this class public so it is an utility to create multi-table datasets.Questions/Ideas
This feature would ease many tasks:
additional_data_tables
)get_dataset_sample("Accidents", type="pandas")
Main design element: A builder pattern.
Dataset()
)PandasDataset
FileDataset
add_table(self, name, source, key=None)
key
mandatory for multi-tablesource
will be different in eachDataset
subclasstrain_test_split
(implemented inPandasDataset
only)sort
sorts the dataset by their keys (implemented inFileDataset
only)create_khiops_dictionary_domain
create_additional_data_table_param
add_relation(self, parent_table_name, child_table_name, one_to_one=False)
remove_table(self, name)
remove_relation(self, parent_table_name, child_table_name)
check(self)
:add_external_relation(self, parent_table_name, key, another_dataset)
Design questions:
check
be called at each mutator call ?check
the consistency before using it ?FileDataset
:train_predictor_ds(ds, target_variable_name, output_dir, <kwargs without additional_data_tables, header_line, field_separator>)
deploy_model_ds(model_kdic, ds, output_dir, <kwargs - additional_data_tables, header_line, field_separator> )