Catching a CrateDB fluke: `ShardCollectContext for {0,2} already added`

amotl commented 1 year ago

Report

While trying to bring in GH-52, we caught an unusual error from CrateDB we haven't seen before.

ShardCollectContext for 0 already added

It is happening on a DELETE FROM SQL statement.

DELETE FROM metrics;

Details

The DELETE FROM metrics is happening within a regular integration test scenario on the canonical setUp() method calling out to self.pruneTables(), in order to supply the test cases with a blank canvas. This is nothing special, we have been doing it like this for a while already, on behalf of different test suites we are maintaining. Also note it was really only a fluke: Re-running the test cases made them succeed on the first attempt already.

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6721484528/job/18267317481?pr=52#step:6:491

Thoughts

As we are using CrateDB nightly for all of our downstream integration tests, it may be a regression introduced just recently.

/cc @matriv, @seut

BaurzhanSakhariev commented 1 year ago

Hi, I don't have permissions to assign myself. I will look into this issue.

amotl commented 1 year ago

Thank you, I've just assigned you. Please note that it was only a single event, and it hasn't been observed ever after. Maybe a real SEU? So, you will have to evaluate if this is important / feels serious, or not.

If you decide that it may actually have been an SEU, or if you can't find anything suspicious, let's just close the issue again.

BaurzhanSakhariev commented 1 year ago

Maybe relates to https://github.com/crate/crate/issues/11677

amotl commented 1 year ago

Another spot on behalf of GH-64.

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6846915144/job/18614271079?pr=64#step:6:488

matriv commented 1 year ago

could it be that we need to have an ensureGreen() alternative for this one?

amotl commented 1 year ago

Thank you for watching this conversation, Marios.

Could it be that we need to have an ensureGreen() alternative for this one?

I am not sure what you are particularly referring to with ensureGreen(). Up until now, I only have been able to spot this flaw twice, starting with its first occurrance two weeks ago.

amotl commented 1 year ago

Just to add more context here, to improve the original report: The flaw is apparently not happening on CrateDB startup, but at regular runtime.

tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_pagination PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_returns_expected_results_with_large_experiment PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id FAILED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_name PASSED [ 87%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_start_time_alias PASSED [ 88%]

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6721484528/job/18267317481?pr=52#step:6:154

tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_pagination PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_returns_expected_results_with_large_experiment PASSED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id FAILED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_name PASSED [ 87%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_start_time_alias PASSED [ 88%]

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6846915144/job/18614271079?pr=64#step:6:488

matriv commented 1 year ago

I am not sure what you are particularly referring to with ensureGreen(). Up until now, I only have been able to spot this flaw twice, starting with its first occurrance two weeks ago.

I'm probably wrong, I was thinking to check that all shards of a table are allocated before proceeding with statements, but we use only one cratedb node, right?

amotl commented 1 year ago

The output I've shared above clearly indicates that it is happenening at the very same test case function test_search_runs_run_id. What might be particularly more interesting in this case, is the predecessor test case function, test_search_runs_returns_expected_results_with_large_experiment ^tsrrerwle.

This observation indicates that CrateDB might not always be ready to accept a DELETE FROM ... statement on a database table after inserting a large amount of data [^1] into it. Do you think we are missing an additional REFRESH TABLE ... statement before conducting the DELETE FROM ...?

We can try, and it may qualify as a workaround. However, you may also want to address this on behalf of CrateDB itself, when applicable.

[^1]: ... at least, of the shape/characteristics as performed by the test_search_runs_returns_expected_results_with_large_experiment test case function, which I haven't investigated yet.

amotl commented 1 year ago

[You are] using only one CrateDB node, right?

This is correct, the issue is about two flukes reported from pedantically observing CI runs where CrateDB Nightly is used purposely.

amotl commented 1 year ago

To add more information about chronology: It has first happened here at GH-52, on Wed, 01 Nov 2023, with this release [^1]:

version[5.6.0-SNAPSHOT], pid[1], build[827ccae/NA], OS[Linux/6.2.0-1015-azure/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.1+12-LTS]

The second spot was:

version[5.6.0-SNAPSHOT], pid[1], build[ca0a7b6/NA], OS[Linux/6.2.0-1015-azure/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.1+12-LTS]

[^1]: Unfortunately, we didn't schedule nightly checks on this repository, so if there indeed is a flaw, it has probably been introduced way earlier.

amotl commented 1 year ago

It happened again, this time on a CI run triggered by a PR submitted by Dependabot.

Origin

https://github.com/crate-workbench/mlflow-cratedb/pull/68

Spots

amotl commented 1 year ago

Now, it is about ShardCollectContext for 2 instead of ShardCollectContext for 0.

FAILED tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id - mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 2 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6985231736/job/19009218790?pr=69#step:6:663

amotl commented 10 months ago

Once more.

SQLParseException[ShardCollectContext for 0 already added] with DELETE FROM metrics.

-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/7375903029/job/20068020355?pr=81#step:6:907

amotl commented 10 months ago

Again...

amotl commented 9 months ago

That patch will increase CrateDB's heap size on CI, in order to explore whether the problem originates from low-memory situations.

GH-100

amotl commented 9 months ago

[...] whether the problem originates from low-memory situations.

Indeed. Decreasing heap size triggers the issue right away. In this spirit, creating a reproducer will be much easier.

docker run --rm -it --name=cratedb \
  --publish=4200:4200 --publish=5432:5432 \
  --env=CRATE_HEAP_SIZE=256m \
  crate/crate:nightly -Cdiscovery.type=single-node

time pytest -vvv -k "test_search_runs_returns_expected_results_with_large_experiment or test_search_runs_run_id"

mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 0 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)

/cc @BaurzhanSakhariev, @jeeminso

amotl commented 9 months ago

The issue has been reported to the upstream crate/crate repository.

https://github.com/crate/crate/issues/15518

amotl commented 7 months ago

Coming from https://github.com/crate/crate/issues/15518#issuecomment-1943985644, and looking at recent nightly scheduled job executions of https://github.com/crate-workbench/mlflow-cratedb/actions, it looks like this problem has been mitigated. Therefore, I am closing this.

Thanks, @jeeminso.

crate / mlflow-cratedb