cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

schema changes: retries of view creation because of stats collection results in non-deterministic table ids #37751

Closed cockroach-teamcity closed 2 years ago

cockroach-teamcity commented 5 years ago

The following tests appear to have failed on master (test): TestLogic/local: TestLogic/local/dependencies, TestLogic/local, TestLogic

You may want to check for open issues.

#1303610:

TestLogic/local: TestLogic/local/dependencies
--- FAIL: test/TestLogic: TestLogic/local: TestLogic/local/dependencies (3.810s)

------- Stdout: -------
=== PAUSE TestLogic/local/dependencies
            logic.go:2316: 

                testdata/logic_test/dependencies:111: SELECT * FROM crdb_internal.backward_dependencies WHERE descriptor_name LIKE 'moretest_%' ORDER BY descriptor_id, index_id, dependson_type, dependson_id, dependson_index_id
                expected:
                    descriptor_id  descriptor_name  index_id  column_id  dependson_id  dependson_type  dependson_index_id  dependson_name  dependson_details
                    62             moretest_v       NULL      NULL       61            view            NULL                NULL            NULL

                but found (query options: "colnames") :
                    descriptor_id  descriptor_name  index_id  column_id  dependson_id  dependson_type  dependson_index_id  dependson_name  dependson_details
                    63             moretest_v       NULL      NULL       61            view            NULL                NULL            NULL

            logic.go:2348: 
                testdata/logic_test/dependencies:117: error while processing
            logic.go:2349: testdata/logic_test/dependencies:117: too many errors encountered, skipping the rest of the input

TestLogic/local
--- FAIL: test/TestLogic: TestLogic/local (0.450s)

TestLogic
--- FAIL: test/TestLogic (571.650s)

------- Stdout: -------
    test_log_scope.go:81: test logs captured to: /tmp/logTestLogic691072255
    test_log_scope.go:62: use -show-logs to present logs inline

Please assign, take a look and update the issue accordingly.

Jira issue: CRDB-5663

nvanbenschoten commented 5 years ago

I can repro this in about a minute of running

make stress PKG=./pkg/sql/logictest TESTS=TestLogic/local/dependencies
nvanbenschoten commented 5 years ago
I190523 17:08:42.632682 1048 sql/event_log.go:135  [n1,client=127.0.0.1:52966,user=root] Event: "create_view", target: 62, info: {ViewName:test.public.moretest_v Statement:CREATE VIEW moretest_v AS SELECT v FROM test.public.moretest_t WHERE false User:root}
I190523 17:08:42.632709 1048 sql/event_log.go:135  [n1,client=127.0.0.1:52966,user=root] Event: "create_view", target: 63, info: {ViewName:test.public.moretest_v Statement:CREATE VIEW moretest_v AS SELECT v FROM test.public.moretest_t WHERE false User:root}

testdata/logic_test/dependencies:111: SELECT * FROM crdb_internal.backward_dependencies WHERE descriptor_name LIKE 'moretest_%' ORDER BY descriptor_id, index_id, dependson_type, dependson_id, dependson_index_id
expected:
    descriptor_id  descriptor_name  index_id  column_id  dependson_id  dependson_type  dependson_index_id  dependson_name  dependson_details
    62             moretest_v       NULL      NULL       61            view            NULL                NULL            NULL

but found (query options: "colnames") :
    descriptor_id  descriptor_name  index_id  column_id  dependson_id  dependson_type  dependson_index_id  dependson_name  dependson_details
    63             moretest_v       NULL      NULL       61            view            NULL                NULL            NULL
nvanbenschoten commented 5 years ago
I190523 18:11:19.430255 1031 sql/create_view.go:180  [n1,client=127.0.0.1:53588,user=root] NEW LOG create view done 62 "sql txn" id=a34185b8 key=/Table/SystemConfigSpan/Start rw=true pri=0.01056497 stat=PENDING epo=0 ts=1558635079.427853029,1 orig=1558635079.422331802,0 max=1558635079.922331802,0 wto=false seq=3
I190523 18:11:19.436152 1031 sql/create_view.go:180  [n1,client=127.0.0.1:53588,user=root] NEW LOG create view done 63 "sql txn" id=a34185b8 key=/Table/SystemConfigSpan/Start rw=true pri=0.01056497 stat=PENDING epo=1 ts=1558635079.427853029,1 orig=1558635079.427853029,1 max=1558635079.922331802,0 wto=false seq=8
I190523 18:11:19.440500 1031 sql/event_log.go:135  [n1,client=127.0.0.1:53588,user=root] Event: "create_view", target: 62, info: {ViewName:test.public.moretest_v Statement:CREATE VIEW moretest_v AS SELECT v FROM test.public.moretest_t WHERE false User:root}
I190523 18:11:19.440567 1031 sql/event_log.go:135  [n1,client=127.0.0.1:53588,user=root] Event: "create_view", target: 63, info: {ViewName:test.public.moretest_v Statement:CREATE VIEW moretest_v AS SELECT v FROM test.public.moretest_t WHERE false User:root}

It looks like this is due to a transaction restart.

nvanbenschoten commented 5 years ago

We're hitting a RETRY_SERIALIZABLE retry error. That means that this is not the same issue as #35977.

I190523 18:22:33.125995 1132 sql/create_view.go:180  [n1,client=127.0.0.1:40642,user=root] NEW LOG create view done 62 "sql txn" id=0a89301a key=/Table/SystemConfigSpan/Start rw=true pri=0.01447941 stat=PENDING epo=0 ts=1558635753.116826665,1 orig=1558635753.104986430,0 max=1558635753.604986430,0 wto=false seq=3
I190523 18:22:33.130027 1132 kv/txn_coord_sender.go:1092  [n1,client=127.0.0.1:40642,user=root,txn=0a89301a] NEW LOG bumping txn epoch "sql txn" id=0a89301a key=/Table/SystemConfigSpan/Start rw=true pri=0.01447941 stat=PENDING epo=1 ts=1558635753.116826665,1 orig=1558635753.116826665,1 max=1558635753.604986430,0 wto=false seq=8: TransactionRetryError: retry txn (RETRY_SERIALIZABLE): "sql txn" id=0a89301a key=/Table/SystemConfigSpan/Start rw=true pri=0.01447941 stat=PENDING epo=0 ts=1558635753.116826665,1 orig=1558635753.104986430,0 max=1558635753.104986430,0 wto=false seq=9
I190523 18:22:33.132274 1132 sql/create_view.go:180  [n1,client=127.0.0.1:40642,user=root] NEW LOG create view done 63 "sql txn" id=0a89301a key=/Table/SystemConfigSpan/Start rw=true pri=0.01447941 stat=PENDING epo=1 ts=1558635753.116826665,1 orig=1558635753.116826665,1 max=1558635753.604986430,0 wto=false seq=8
nvanbenschoten commented 5 years ago

I'm seeing the transaction get its timestamp bumped by the timestamp cache when it goes to Put key /Table/3/1/61/2/1. This is a Put on the descriptor 61 in the descriptor table. So this means that some other process is reading the descriptor of the new view's (moretest_v) table (moretest_t) concurrently with the schema change that is creating the view.

nvanbenschoten commented 5 years ago

The timestamp cache entry has a transaction ID, so this does not have to do with lease changes or closed timestamps.

nvanbenschoten commented 5 years ago

The transaction that is performing this read is running under an internalExecutor with the create-stats operation name.

@RaduBerinde @rytaft has anything changed in this area lately? Are we creating stats more frequently or is there any reason why this might now be running in logic tests?

nvanbenschoten commented 5 years ago

Disabling automatic statistics collection appears to fix this flake.

RaduBerinde commented 5 years ago

The recent change in this area is #37611 - table creation now triggers a stats run.

andreimatei commented 5 years ago

I have some thoughts on the topic. Here's my half-understanding-half-speculation about what's going on:

First question is - why is the id allocation non-transactional? It turns out that I have made it so in https://github.com/cockroachdb/cockroach/commit/74f5728dfc354f0324f51fe271bd36a6da2c3918. The reason the commit lists is because of ORMs which don't like txn retries for schema changes (I think it was motivated at the time in part by the Jepsen test suite which couldn't handle these retries). And so the point of making the id alloc non-txn was to isolate the contentious part of different schema change txns, so that the broaders txns wouldn't have to conflict.

Now, the auto-stats collection seems to have reintroduced some retries for schema changes, and so I guess that defeats the purpose of making that id allocation non-transactional and presumably is again a problem for Jepsen and tools/ORMs. So I think the problem here might be bigger than just tests that hardcode txn ids. (In the case of logic tests I'm guessing we're able to perform auto-retries but in the wild we often can't and that trips up tools).

In my opinion, the root of all evil here is the existence of the txn.CommitTimestamp() method https://github.com/cockroachdb/cockroach/blob/e92f68af8dd507fb80f1950f969e62e7819be027/pkg/internal/client/txn.go#L262 which prevents the refreshes. Were it not for this, the view txn would presumably refresh just fine and succeed the first time. I've always hated the use of that function for schema changes (which I think might also be the only use) because the information that it provides - the "modification time" of a table descriptor - can be "trivially" deduced from KV: it's the MVCC timestamp of the key in question. No need to record it explicitly. I believe a ScanRequest even carries the mvcc timestamps in it; it's just that they get lost towards SQL. I think this is a good opportunity to kill this wart. Giving SQL access to these timestamps would also open the door to all sort of features - such as ones requested here: https://github.com/cockroachdb/cockroach/issues/19749

If I can snipe someone to do this (and assuming the theory checks out), I think it would be better than papering over these id allocations somehow - like for example through remembering allocated but unused ids as Nathan and other have suggested. Curious to see if the cat licks it up.

cc @dt

andreimatei commented 5 years ago

also cc @danhhz

knz commented 5 years ago

I loved reading the analyses.

I think the requirement that DDL does not encounter retries is still there, that still really causes problems with 3rd party tools. Retries should still be minimized for DDL (or at least occur as little as possible, thus not "statically probably" when stat collection is enabled).

The fact that ID increments are non-transactional is a red herring in this discussion -- we can work around that easily in tests, and it is largely invisible to clients.

Andrei I like your idea to simplify the DDL handling code in SQL to minimize the way it blocks a txn from being auto-pushed. If it can use MVCC timestamps and achieve that (and thus avoid a few more retries), that would be a good thing.

nvanbenschoten commented 5 years ago

@andreimatei do you want to keep this open and reorient it around https://github.com/cockroachdb/cockroach/issues/37751#issuecomment-495391682?

andreimatei commented 5 years ago

Yeah. I believe it's oriented already :P

nvanbenschoten commented 5 years ago

Let's rename it and give it appropriate labels then.

andreimatei commented 5 years ago

Renamed. But Radu, I'm moving it back to something under SQL. Although there might be minor "kv" changes needed, the problems here are really SQL problems.

To recap, my understanding is that:

For the moment I see we've turned off stats collection for logic tests, but that seems not too great.

thoszhang commented 4 years ago

@ajwerner Do you think this is fixed by #40793? Maybe we should turn stats collection back on for logic tests.

ajwerner commented 4 years ago

Certainly possible. We should at least experiment with it.