crate / crate-clients-tools

Clients, tools, and integrations for CrateDB.
https://crate.io/docs/clients/
Apache License 2.0
2 stars 1 forks source link

Problem running MindsDB with CrateDB #49

Open amotl opened 1 year ago

amotl commented 1 year ago

Hi there,

@hammerhead recently evaluated the Forecasting Quarterly House Sales with MindsDB tutorial with CrateDB and this Python driver (thanks!), so I would like to report about the outcome.

With kind regards, Andreas.

Report

An error is thrown while computing the model, and the model is set to failed. The same example works well when using PostgreSQL.

INFO:type_infer-6520:Analyzing a sample of 5
INFO:type_infer-6520:from a total population of 5, this is equivalent to 100.0% of your data.
INFO:type_infer-6520:Infering type for: saledate
INFO:type_infer-6520:Column saledate has data type categorical
INFO:type_infer-6520:Infering type for: ma
INFO:type_infer-6520:Column ma has data type binary
INFO:type_infer-6520:Infering type for: type
INFO:type_infer-6520:Column type has data type binary
INFO:type_infer-6520:Infering type for: bedrooms
INFO:type_infer-6520:Column bedrooms has data type binary
WARNING:type_infer-6520:Column saledate is an identifier of type "UUID"
WARNING:type_infer-6520:Column bedrooms is an identifier of type "No Information"
INFO:dataprep_ml-6520:Starting statistical analysis
INFO:dataprep_ml-6520:Dropping features: ['saledate']
Traceback (most recent call last):
  File "/path/to/mindsdb/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'saledate'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/libs/ml_exec_base.py", line 128, in learn_process
    ml_handler.create(target, df=training_data_df, args=problem_definition)
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/lightwood_handler.py", line 67, in create
    run_learn(
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/utilities/functions.py", line 58, in wrapper
    return func(*args, **kwargs)
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/functions.py", line 147, in run_learn
    run_generate(df, predictor_id, args)
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/utilities/functions.py", line 58, in wrapper
    return func(*args, **kwargs)
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/functions.py", line 58, in run_generate
    json_ai = lightwood.json_ai_from_problem(df, problem_definition)
  File "/path/to/mindsdb/lib/python3.8/site-packages/lightwood/api/high_level.py", line 74, in json_ai_from_problem
    stats = statistical_analysis(
  File "/path/to/mindsdb/lib/python3.8/site-packages/dataprep_ml/insights.py", line 61, in statistical_analysis
    df = cleaner(data, dtypes, args.get('pct_invalid', 0),
  File "/path/to/mindsdb/lib/python3.8/site-packages/dataprep_ml/cleaners.py", line 57, in cleaner
    data = clean_timeseries(data, timeseries_settings)
  File "/path/to/mindsdb/lib/python3.8/site-packages/dataprep_ml/cleaners.py", line 397, in clean_timeseries
    if pd.isna(row[tss['order_by']]):
  File "/path/to/mindsdb/lib/python3.8/site-packages/pandas/core/series.py", line 942, in __getitem__
    return self._get_value(key)
  File "/path/to/mindsdb/lib/python3.8/site-packages/pandas/core/series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "/path/to/mindsdb/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'saledate'
amotl commented 1 year ago

Disclaimer: I was just reading the traceback and quickly poked a bit at the code of MindsDB, nothing else yet.

State of the onion

Initially, in 2021, support for CrateDB has been added on behalf of a datasource module ^3 with https://github.com/mindsdb/datasources/pull/58.

Then, in 2022, another module supporting CrateDB has been added with https://github.com/mindsdb/mindsdb/pull/2682, which is effectively described and documented as the "CrateDB Integration" ^1, and implemented as a handler module ^handler-cratedb.

Traceback

Now, I can't see that this handler module is actually used when looking at the traceback. What is striking is that the lightwood_handler ^handler-lightwood is used instead.

Traceback (most recent call last):
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/libs/ml_exec_base.py", line 128, in learn_process
    ml_handler.create(target, df=training_data_df, args=problem_definition)
  File "/path/to/mindsdb/lib/python3.8/site-packages/mindsdb/integrations/handlers/lightwood_handler/lightwood_handler/lightwood_handler.py", line 67, in create
    run_learn(

Other than this, I am not sure if the type inferring subsystem may be misguided here, and if that actually causes the error.

INFO:type_infer-6520:Infering type for: saledate
INFO:type_infer-6520:Column saledate has data type categorical
INFO:type_infer-6520:Infering type for: ma
INFO:type_infer-6520:Column ma has data type binary
INFO:type_infer-6520:Infering type for: type
INFO:type_infer-6520:Column type has data type binary
INFO:type_infer-6520:Infering type for: bedrooms
INFO:type_infer-6520:Column bedrooms has data type binary
WARNING:type_infer-6520:Column saledate is an identifier of type "UUID"
WARNING:type_infer-6520:Column bedrooms is an identifier of type "No Information"
hammerhead commented 1 year ago

I've raised it as an issue in the MindsDB repository.

Based on earlier feedback from the MindsDB team, the KeyError: 'saledate' is thrown when the data set is too small. Upon increasing the data set, the error changes as posted in the linked issue.

amotl commented 6 months ago

Hi. The upstream issue has been closed as completed, but without any further information. Shall we re-evaluate the situation?

hammerhead commented 6 months ago

Hi,

I checked the commit history since the last time I looked at it, and I don't see any changes. The closing of the upstream issue probably has to be interpreted as "won't fix", since there are also no linked pull requests or anything else.

amotl commented 6 months ago

Thanks. Can you re-open the issue, and can we discuss it? I would like to use your insights to eventually submit a patch, when applicable. However, I haven't approached the topic yet, just tried to contribute my share by tracking it.

A concise minimal reproducer could help to get closer to the issue, the tutorials referenced above starts a bit too high-level for me, apparently expecting a completed setup already.

amotl commented 6 months ago

Dear @chandrevdw31,

thank you for submitting GH-118. Does that mean MindsDB works well together with CrateDB now?

As you can read from this discussion, we could not derive any clear outcome from https://github.com/mindsdb/mindsdb/issues/5483 ff., and did not re-evaluate the situation yet on our behalves.

With kind regards, Andreas.

chandrevdw31 commented 6 months ago

Hi @amotl

Thank you for raising this. I will investigate this further with the team and have this matter resolved.

Kind regards

amotl commented 5 months ago

Thank you very much, Chandre. Do you have any news to report about this matter?

/cc @hlcianfagna