crate / cratedb-examples

A collection of clear and concise examples how to work with CrateDB.
Apache License 2.0
8 stars 7 forks source link

TSML: Error in `timeseries-anomaly-detection.ipynb` #426

Closed amotl closed 2 months ago

amotl commented 2 months ago

Problem

The timeseries-anomaly-detection.ipynb notebook errors out, both on Python 3.10 and 3.11 ^1.

ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by SimpleImputer.

Observations

Because it happens on both versions of Python, it is most probably unrelated to the change per se where it started tripping.

Thoughts

Most probably another dependency flaw?

amotl commented 2 months ago

Observations

scikit-learn 1.4.2 was released on Apr 9, 2024. Is it related?

-- https://pypi.org/project/scikit-learn/1.4.2/#history

Thoughts

If it is, the reason why the corresponding CI job did not fail before more prominently, on the nightly runs to validate functionality, is most probably because dependencies are configured to be cached when the local requirements files do not change.

In this case, the nightly CI jobs do not catch updates to transitive dependencies not enumerated locally, and thus, do not hold up to their promise to give you a constant piece of mind in "on stage" situations. In this spirit, what is reflected on the Build Status page, might not convey the whole truth, and I am sad about it.

/cc @marijaselakovic, @ckurze, @hammerhead, @simonprickett

amotl commented 2 months ago

I am able to confirm this error on my workstation, using Python 3.11.

source .venv/bin/activate
pip install --upgrade scikit-learn
cd topic/timeseries
pytest -k timeseries-anomaly-detection.ipynb
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by SimpleImputer.

However, I am also seeing this one, where the second one might actually be a follow-up error.

ProgrammingError: (crate.client.exceptions.ProgrammingError) RelationAlreadyExists[Relation 'notebook.machine_data' already exists.]
[SQL: CREATE TABLE machine_data ("timestamp" TIMESTAMP, "value" DOUBLE PRECISION)]
amotl commented 2 months ago

On behalf of GH-425, the RelationAlreadyExists error has been fixed with fdb91dd703, but, despite downgrading scikit-learn using d244f4345b6, the array shape error is still there, but only on Python 3.10 now, and only on CI. On my workstation, software tests also succeed using Python 3.10.13.

-- https://github.com/crate/cratedb-examples/actions/runs/8744363838/job/23997048918?pr=425#step:6:951

amotl commented 2 months ago

Taking a closer look, ValueError: Found array with 0 sample(s) may also convey it is related to CrateDB's eventual consistency, so ab42144174b adds a relevant REFRESH TABLE "tablename"; SQL statement, in order to synchronize writes.

amotl commented 2 months ago

Indeed, it apparently has been the missing REFRESH TABLE statement, so writes have not been synchronized, so the result was not visible by subsequent querying statements. Apparently, it is not related to scikit-learn 1.4.2 at all. GH-425 will improve the situation. d244f4345b62 has been removed again.