cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.86k stars 3.77k forks source link

teamcity: failed test: TestLogic #35304

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 5 years ago

The following tests appear to have failed on master (test): TestLogic/fakedist: TestLogic/fakedist-metadata, TestLogic, TestLogic/fakedist: TestLogic/fakedist-metadata: TestLogic/fakedist-metadata/fk

You may want to check for open issues.

#1156962:

TestLogic/fakedist: TestLogic/fakedist-metadata
--- FAIL: test/TestLogic: TestLogic/fakedist: TestLogic/fakedist-metadata (0.360s)

TestLogic/fakedist: TestLogic/fakedist-metadata: TestLogic/fakedist-metadata/fk
--- FAIL: test/TestLogic: TestLogic/fakedist: TestLogic/fakedist-metadata: TestLogic/fakedist-metadata/fk (2.540s)

------- Stdout: -------
=== PAUSE TestLogic/fakedist-metadata/fk
            logic.go:2313: 

                testdata/logic_test/fk:318: 
                expected "foreign key violation: \"delivery\" row item='885155001450' has no match in \"products\"", but found "pq: expected 13 metadata senders but found 11; missing ae10e1a7-4615-45ed-beb8-fb4632e27be3, 5436c8c4-3b88-4ed4-ac49-a2e3d2c5b392"
            logic.go:2345: 
                testdata/logic_test/fk:321: error while processing
            logic.go:2346: testdata/logic_test/fk:321: too many errors encountered, skipping the rest of the input

TestLogic
--- FAIL: test/TestLogic (422.610s)

------- Stdout: -------
    test_log_scope.go:81: test logs captured to: /tmp/logTestLogic876114019
    test_log_scope.go:62: use -show-logs to present logs inline

Please assign, take a look and update the issue accordingly.

jordanlewis commented 5 years ago

Huh, this is pretty weird... cc @asubiotto

asubiotto commented 5 years ago

Seems like something pretty recent. The impact of dropped metadata is performance issues caused by continuously misplanning tablereaders and dropped stats in EXPLAIN ANALYZE as far as I know. We need to bisect to try to find the commit that introduced this which will inform further action. I can take care of this on Monday.

knz commented 5 years ago

Another repro https://teamcity.cockroachdb.com/viewLog.html?buildId=1170709&

asubiotto commented 5 years ago

3ef1788 seems to be the first bad commit:

3ef1788b882f2b87079a97025c1be0e02f31b865 is the first bad commit
commit 3ef1788b882f2b87079a97025c1be0e02f31b865
Author: Jordan Lewis <jordanthelewis@gmail.com>
Date:   Tue Jan 29 17:40:54 2019 -0500

    sql: delete startPlan and users of it

    In preparation of deleting the execution pathway of the local engine,
    delete startPlan, the old top-level way of executing a planNode tree.
    The usages from scrub, check and foreign key constraint validation were
    migrated to the distsql execution engine.

    Release note: None

Have to verify but it seems likely. Will keep on digging, it's interesting that this is not deterministically reproducible.

jordanlewis commented 5 years ago

Uhoh... let me know if you would prefer that I take this one over. Probably was that commit.

asubiotto commented 5 years ago

Simpler reproduction:

# LogicTest: fakedist-metadata

# Disable automatic stats to avoid flakiness.
statement ok
SET CLUSTER SETTING sql.stats.experimental_automatic_collection.enabled = false

statement ok
CREATE TABLE products (sku STRING PRIMARY KEY, upc STRING UNIQUE, vendor STRING)

statement ok
INSERT INTO products VALUES ('VP-W9QH-W44L', '867072000006', 'Dave'), ('780', '885155001450', 'iRobot')

# "delivery" is interesting since it references a secondary index with different col names.
statement ok
CREATE TABLE delivery (
  item STRING REFERENCES products (upc),
  INDEX (item)
)

query TT
SHOW CREATE TABLE delivery
----
delivery  CREATE TABLE delivery (
          item STRING NULL,
          CONSTRAINT fk_item_ref_products FOREIGN KEY (item) REFERENCES products (upc),
          INDEX delivery_item_idx (item ASC),
          FAMILY "primary" (item, rowid)
)

statement ok
INSERT INTO delivery (item) VALUES ('885155001450')

# No-op change to existing data is fine.
statement ok
UPDATE products SET upc = '885155001450' WHERE sku = '780'

statement ok
ALTER TABLE delivery DROP CONSTRAINT fk_item_ref_products

statement ok
UPDATE products SET upc = 'blah' WHERE sku = '780'

statement ok
ALTER TABLE delivery ADD FOREIGN KEY (item) REFERENCES products (upc)

statement error pgcode 23503 foreign key violation: "delivery" row item='885155001450' has no match in "products"
ALTER TABLE delivery VALIDATE CONSTRAINT fk_item_ref_products

The cause must be that metadata isn't propagated properly when an error is encountered when performing constraint validation. I think the above commit just made this easier to hit.

asubiotto commented 5 years ago

It seems interesting that the metadata error message specifies two senders:

"pq: expected 13 metadata senders but found 11; missing 72a79b8e-e9c0-4bdd-bdda-b2c44180287b, 6af59fba-c1d5-421c-ab7a-59912705f5c2"

But the plan only shows one: image I asserted that the plan looks as expected in the failure case.

asubiotto commented 5 years ago

What's more likely happening is that the check itself is spinning off senders/receivers and that is failing. Look at this bad boy: image Plan explained from EXPLAIN (DISTSQL) SELECT s.item FROM (SELECT * FROM test.public.delivery@delivery_item_idx WHERE item IS NOT NULL) AS s LEFT OUTER JOIN test.public.products@products_upc_key AS t ON s.item = t.upc WHERE t.upc IS NULL LIMIT 1

asubiotto commented 5 years ago

What a tricky little bug. The bug only happens on plans where the merge joiner right side is an ordered synchronizer: image

What ended up happening was that the ordered synchronizer would have ConsumerDone called on it when it hadn't yet been initialized (nobody had called Next yet). This meant that the source heap was uninitialized and since we only drain sources from the heap (if we are in any other state than uninitialized, we assume sources not on the heap are done, not uninitialized) we would miss draining these sources. This line of code fixes the test failure:

diff --git a/pkg/sql/distsqlrun/input_sync.go b/pkg/sql/distsqlrun/input_sync.go
index cb7bf2f96e..0a5c8438c7 100644
--- a/pkg/sql/distsqlrun/input_sync.go
+++ b/pkg/sql/distsqlrun/input_sync.go
@@ -337,6 +337,7 @@ func (s *orderedSynchronizer) consumerStatusChanged(
        if s.state == notInitialized {
                for i := range s.sources {
                        f(s.sources[i].src)
+                       s.heap = append(s.heap, srcIdx(i))
                }
        } else {
                // The sources that are not in the heap have been consumed already. It

Will think about whether this is what we actually want to do, add a test, and PR it.

This is a great reminder that test failures are real bugs sometimes, and worth the investigation.