Closed amotl closed 7 months ago
Hi, I don't have permissions to assign myself. I will look into this issue.
Thank you, I've just assigned you. Please note that it was only a single event, and it hasn't been observed ever after. Maybe a real SEU? So, you will have to evaluate if this is important / feels serious, or not.
If you decide that it may actually have been an SEU, or if you can't find anything suspicious, let's just close the issue again.
Maybe relates to https://github.com/crate/crate/issues/11677
Another spot on behalf of GH-64.
could it be that we need to have an ensureGreen()
alternative for this one?
Thank you for watching this conversation, Marios.
Could it be that we need to have an
ensureGreen()
alternative for this one?
I am not sure what you are particularly referring to with ensureGreen()
. Up until now, I only have been able to spot this flaw twice, starting with its first occurrance two weeks ago.
Just to add more context here, to improve the original report: The flaw is apparently not happening on CrateDB startup, but at regular runtime.
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_pagination PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_returns_expected_results_with_large_experiment PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id FAILED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_name PASSED [ 87%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_start_time_alias PASSED [ 88%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_pagination PASSED [ 85%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_returns_expected_results_with_large_experiment PASSED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id FAILED [ 86%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_name PASSED [ 87%]
tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_start_time_alias PASSED [ 88%]
I am not sure what you are particularly referring to with
ensureGreen()
. Up until now, I only have been able to spot this flaw twice, starting with its first occurrance two weeks ago.
I'm probably wrong, I was thinking to check that all shards of a table are allocated before proceeding with statements, but we use only one cratedb node, right?
The output I've shared above clearly indicates that it is happenening at the very same test case function test_search_runs_run_id
. What might be particularly more interesting in this case, is the predecessor test case function, test_search_runs_returns_expected_results_with_large_experiment
^tsrrerwle.
This observation indicates that CrateDB might not always be ready to accept a DELETE FROM ...
statement on a database table after inserting a large amount of data [^1] into it. Do you think we are missing an additional REFRESH TABLE ...
statement before conducting the DELETE FROM ...
?
We can try, and it may qualify as a workaround. However, you may also want to address this on behalf of CrateDB itself, when applicable.
[^1]: ... at least, of the shape/characteristics as performed by the test_search_runs_returns_expected_results_with_large_experiment
test case function, which I haven't investigated yet.
[You are] using only one CrateDB node, right?
This is correct, the issue is about two flukes reported from pedantically observing CI runs where CrateDB Nightly is used purposely.
To add more information about chronology: It has first happened here at GH-52, on Wed, 01 Nov 2023, with this release [^1]:
version[5.6.0-SNAPSHOT], pid[1], build[827ccae/NA], OS[Linux/6.2.0-1015-azure/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.1+12-LTS]
The second spot was:
version[5.6.0-SNAPSHOT], pid[1], build[ca0a7b6/NA], OS[Linux/6.2.0-1015-azure/amd64], JVM[Eclipse Adoptium/OpenJDK 64-Bit Server VM/21.0.1+12-LTS]
[^1]: Unfortunately, we didn't schedule nightly checks on this repository, so if there indeed is a flaw, it has probably been introduced way earlier.
It happened again, this time on a CI run triggered by a PR submitted by Dependabot.
Now, it is about ShardCollectContext for 2
instead of ShardCollectContext for 0
.
FAILED tests/test_tracking.py::TestSqlAlchemyStore::test_search_runs_run_id - mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 2 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)
Once more.
SQLParseException[ShardCollectContext for 0 already added]
with DELETE FROM metrics
.
That patch will increase CrateDB's heap size on CI, in order to explore whether the problem originates from low-memory situations.
[...] whether the problem originates from low-memory situations.
Indeed. Decreasing heap size triggers the issue right away. In this spirit, creating a reproducer will be much easier.
docker run --rm -it --name=cratedb \
--publish=4200:4200 --publish=5432:5432 \
--env=CRATE_HEAP_SIZE=256m \
crate/crate:nightly -Cdiscovery.type=single-node
time pytest -vvv -k "test_search_runs_returns_expected_results_with_large_experiment or test_search_runs_run_id"
mlflow.exceptions.MlflowException: (crate.client.exceptions.ProgrammingError) SQLParseException[ShardCollectContext for 0 already added]
[SQL: DELETE FROM metrics]
(Background on this error at: https://sqlalche.me/e/20/f405)
/cc @BaurzhanSakhariev, @jeeminso
The issue has been reported to the upstream crate/crate repository.
Coming from https://github.com/crate/crate/issues/15518#issuecomment-1943985644, and looking at recent nightly scheduled job executions of https://github.com/crate-workbench/mlflow-cratedb/actions, it looks like this problem has been mitigated. Therefore, I am closing this.
Thanks, @jeeminso.
Report
While trying to bring in GH-52, we caught an unusual error from CrateDB we haven't seen before.
It is happening on a
DELETE FROM
SQL statement.Details
The
DELETE FROM metrics
is happening within a regular integration test scenario on the canonicalsetUp()
method calling out toself.pruneTables()
, in order to supply the test cases with a blank canvas. This is nothing special, we have been doing it like this for a while already, on behalf of different test suites we are maintaining. Also note it was really only a fluke: Re-running the test cases made them succeed on the first attempt already.-- https://github.com/crate-workbench/mlflow-cratedb/actions/runs/6721484528/job/18267317481?pr=52#step:6:491
Thoughts
As we are using CrateDB nightly for all of our downstream integration tests, it may be a regression introduced just recently.
/cc @matriv, @seut