Open sgouache opened 3 months ago
I'm looking into it. @sgouache could you please run the task with trace=True
(or --verbose
in khrun
). It is the dev
version that you are using?.
A few general remarks (based on the latest 10.2.2.4 khiops-python release) regarding the so called format spec preprocessing...
Currently when you call the python api here are the results (they are consistent with the current docstring documentation) :
Input | Output to khiops | Notes |
---|---|---|
no input | defect_format=True, header_line=True, field_separator='' | Room for improvement here (for human readers) : set header_line to False as detect_format will take precedence |
header_line=True | defect_format=False, header_line=True, field_separator='' | |
field_separator=',' | defect_format=False, header_line=True, field_separator=',' | Room for improvement here, header_line should be left False as it does not correlate with the modified field |
detect_format=True | defect_format=True, header_line=False, field_separator='' | |
detect_format=True, field_separator=',' | defect_format=False, header_line=False, field_separator=',' | cf python api vs protobuf api discussion below |
The suggested corrections are not mandatory as we rely eventually on khiops to produce a correct behavior (DetectFormat
if set takes always precedence over the 2 other parameters).
This matrix does not totally confirm your findings @sgouache especially on the mixted case. Input (detect_format=True, field_separator=',') produces a (defect_format=True, field_separator=',') for you.
It is because we follow 2 different execution flows depending on the context and you follow the 2nd one.
pure api python
ex. train_predictor function ...
-> _preprocess_task_arguments ...
-> _preprocess_format_spec ...
...
-> runner::_write_task_scenario_file (for khiops)
protobuf api (used in the python khiops runner context)
ex. train_predictor protobuf task ...
-> khiops_api_base::khrun::write_scenario
-> loop on khiops_api_base::khrun::write_api_call_scenario
<BIG GAP here regarding the tasks args sanitization, on rely on the default values and the scenario template>
-> khiops_api_base::khrun::write_execution_scenario
In order to keep having the same behaviour on both execution paths, we need to always use the same checking function (`_preprocess_task_arguments`). It will prevent that kind of surprises (we have already faced one for a "grouping and discretization methods for a non supervised training")
A simple fix in `khiops_api_base::khrun` would be :
```python
def write_api_call_scenario(api_call_pb, writer):
"""Writes the scenario for KhiopsApiCall protobuf spec
.. note:
It doesn't write the exit statement.
Parameters
----------
api_call_pb : KhiopsApiCall
The protobuf class instance for the api call.
writer : `.KhiopsOutputWriter`
The output writer to which write the task scenario.
"""
assert type(api_call_pb).__name__ == "KhiopsApiCall"
api_call, api_call_args = _create_runner_api_call_args(api_call_pb)
# PATCH >
from khiops.core.api import _preprocess_task_arguments
_preprocess_task_arguments(api_call_args)
# < PATCH
api_call.write_execution_scenario(writer, api_call_args)
Naturally @folmos-at-orange and @popescu-v need to confirm or correct all of this.
AFAIU, we need to factor out the input parameter checks and preprocessing from the _run_task
function, into a public function that can be used directly in khrun
.
Description
In many API functions, the detect_format flag is available. It defaults to true, so users providing headerline and separator are sometimes unaware that is is set. Problem: when both detect_format and headerline + separator are set, the detection is run even if its results are ignored. For some datatables, the detection produces confusing warnings which causes users to wonder if the processing was done with the right separator. Example output: warning : "/usr/local/lib/python3.10/dist-packages/runner/khiops_runner.py:465: UserWarning: Khiops ended correctly but there were minor issues: Warnings in log: Line 2: warning : Database format detector s3://bkt-guporionis-backend-dev-ow-idh-fe/orionis/deployment/caller/2024/06/06/0800-1200/flows/*.csv : 1 other field separator is possible: '-' runner.run(args)".
Questions/Ideas