Expose Dataset API - Githubissues

Description

Currently the Dataset class is an internal utility for the sklearn module. The idea is to render this class public so it is an utility to create multi-table datasets.

Questions/Ideas

This feature would ease many tasks:

Splitting a dataset in train/test
Sorting a whole dataset
Building core api parameters (notably additional_data_tables)
Simplify tutorials and samples by having a method get_dataset_sample("Accidents", type="pandas")

Main design element: A builder pattern.

Add mutator methods to construct the dataset from an empty one (Dataset())
(Partially) Implemented in prototype:
- Classes
- PandasDataset
- FileDataset
- add_table(self, name, source, key=None)
- key mandatory for multi-table
- source will be different in each Dataset subclass
- train_test_split (implemented in PandasDataset only)
- sort sorts the dataset by their keys (implemented in FileDataset only)
- create_khiops_dictionary_domain
- create_additional_data_table_param
- add_relation(self, parent_table_name, child_table_name, one_to_one=False)
Not in prototype:
- remove_table(self, name)
- Removes all relations asociated the the table
- remove_relation(self, parent_table_name, child_table_name)
- check(self):
- Raises warnings and exceptions
  - errors:
  - Non-existent table names
  - No main table set in multi-table datasets
  - No key set in multi-table datasets
  - warnings:
  - Dangling tables
- add_external_relation(self, parent_table_name, key, another_dataset)

Design questions:

Immediate consistency checks:
- That is , should check be called at each mutator call ?
- I'm inclined to this one since the target audience are not only dev's
- or the user should check the consistency before using it ?
Should we accept mono-table datasets ?
- This adds many edge-cases
What about helper functions using the FileDataset:
- train_predictor_ds(ds, target_variable_name, output_dir, <kwargs without additional_data_tables, header_line, field_separator>)
- deploy_model_ds(model_kdic, ds, output_dir, <kwargs - additional_data_tables, header_line, field_separator> )

KhiopsML / khiops-python

Expose Dataset API #158

Description

Questions/Ideas