KhiopsML / khiops-python

The Python library of the Khiops AutoML suite
https://khiops.org
BSD 3-Clause Clear License
8 stars 1 forks source link

detect_format is not ignored when headerline and separator are provided #208

Open sgouache opened 3 months ago

sgouache commented 3 months ago

Description

In many API functions, the detect_format flag is available. It defaults to true, so users providing headerline and separator are sometimes unaware that is is set. Problem: when both detect_format and headerline + separator are set, the detection is run even if its results are ignored. For some datatables, the detection produces confusing warnings which causes users to wonder if the processing was done with the right separator. Example output: warning : "/usr/local/lib/python3.10/dist-packages/runner/khiops_runner.py:465: UserWarning: Khiops ended correctly but there were minor issues: Warnings in log: Line 2: warning : Database format detector s3://bkt-guporionis-backend-dev-ow-idh-fe/orionis/deployment/caller/2024/06/06/0800-1200/flows/*.csv : 1 other field separator is possible: '-' runner.run(args)".

Questions/Ideas

folmos-at-orange commented 3 months ago

I'm looking into it. @sgouache could you please run the task with trace=True (or --verbose in khrun). It is the dev version that you are using?.

tramora commented 2 months ago

A few general remarks (based on the latest 10.2.2.4 khiops-python release) regarding the so called format spec preprocessing...

Currently when you call the python api here are the results (they are consistent with the current docstring documentation) :

Input Output to khiops Notes
no input defect_format=True, header_line=True, field_separator='' Room for improvement here (for human readers) : set header_line to False as detect_format will take precedence
header_line=True defect_format=False, header_line=True, field_separator=''
field_separator=',' defect_format=False, header_line=True, field_separator=',' Room for improvement here, header_line should be left False as it does not correlate with the modified field
detect_format=True defect_format=True, header_line=False, field_separator=''
detect_format=True, field_separator=',' defect_format=False, header_line=False, field_separator=',' cf python api vs protobuf api discussion below

The suggested corrections are not mandatory as we rely eventually on khiops to produce a correct behavior (DetectFormat if set takes always precedence over the 2 other parameters).

python api vs protobuf api discussion

This matrix does not totally confirm your findings @sgouache especially on the mixted case. Input (detect_format=True, field_separator=',') produces a (defect_format=True, field_separator=',') for you.

It is because we follow 2 different execution flows depending on the context and you follow the 2nd one.


In order to keep having the same behaviour on both execution paths, we need to always use the same checking function (`_preprocess_task_arguments`). It will prevent that kind of surprises (we have already faced one for a "grouping and discretization methods for a non supervised training")

A simple fix in `khiops_api_base::khrun` would be :

```python
def write_api_call_scenario(api_call_pb, writer):
    """Writes the scenario for KhiopsApiCall protobuf spec

    .. note:
        It doesn't write the exit statement.

    Parameters
    ----------
    api_call_pb : KhiopsApiCall
        The protobuf class instance for the api call.
    writer : `.KhiopsOutputWriter`
        The output writer to which write the task scenario.
    """
    assert type(api_call_pb).__name__ == "KhiopsApiCall"
    api_call, api_call_args = _create_runner_api_call_args(api_call_pb)
    # PATCH >
    from khiops.core.api import _preprocess_task_arguments
    _preprocess_task_arguments(api_call_args)
    # < PATCH
    api_call.write_execution_scenario(writer, api_call_args)

Naturally @folmos-at-orange and @popescu-v need to confirm or correct all of this.

popescu-v commented 1 month ago

AFAIU, we need to factor out the input parameter checks and preprocessing from the _run_task function, into a public function that can be used directly in khrun.