ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.64k stars 6.91k forks source link

Serialization error after INSERT when using implicit_transaction #43767

Open realyota opened 1 year ago

realyota commented 1 year ago

Describe what's wrong

I'm testing experimental transactional support and I stumbled across bizarre behavior regarding INSERT atomicity.

I properly configured allow_experimental_transactions and all statements are executed using:

 SET implicit_transaction = true, throw_on_unsupported_query_inside_transaction = false;

After inserting a larger batch of data I want to ALTER the data by executing the statement that is using the data from the INSERT statement. It results in `Serialization error: part all_1221_1221_0 is locked by transaction 10837109814998401029' exception.

If I wait some time before proceeding with the ALTER statement (like 30 seconds) no exception is raised. The same applies if I change min_insert_block_size_rows to a number that is higher that the count of the inserted rows (set min_insert_block_size_rows = 10000000001) or if I execute BEGIN TRANSACTION at the beginning of the SQL block (committing it after).

Does it reproduce on recent release?

Yes. Tested on 22.11.1.1360.

How to reproduce

 SET implicit_transaction = true, throw_on_unsupported_query_inside_transaction = false;

DROP TABLE IF EXISTS t

Query id: 61a12db2-aa8a-49a9-b498-d56124d0c3ae

Ok.

0 rows in set. Elapsed: 0.270 sec.

CREATE TABLE t
(
    `id` int,
    `c1` int
)
ENGINE = MergeTree
ORDER BY id

Query id: c29f60ec-e0eb-4357-9164-98611afc016c

Ok.

0 rows in set. Elapsed: 0.009 sec.

INSERT INTO t (id, c1) SELECT
    number,
    number % 30
FROM numbers(1000000000)

Query id: 072a822f-caa7-4621-9b61-3941229a30ea

Ok.

0 rows in set. Elapsed: 15.563 sec. Processed 1.00 billion rows, 8.00 GB (64.25 million rows/s., 514.04 MB/s.)

ALTER TABLE t
    DELETE WHERE c1 IN (
        SELECT DISTINCT c1
        FROM t
        WHERE (c1 % 17) = 0
    )
SETTINGS mutations_sync = 2

Query id: 9495b60d-18b3-48c4-8384-76b6e1d24f9a

Ok.

0 rows in set. Elapsed: 6.717 sec.

Received exception from server (version 22.11.1):
Code: 341. DB::Exception: Received from localhost:9000. DB::Exception: Exception happened during execution of mutation 'mutation_1239.txt' with part 'all_1221_1221_0' reason: 'Serialization error: part all_1221_1221_0 is locked by transaction 10837109814998401029'. This error maybe retryable or not. In case of unretryable error, mutation can be killed with KILL MUTATION query. (UNFINISHED)

When I query for that transaction I see:

SELECT *
FROM system.transactions_info_log
WHERE tid_hash = 10837109814998401029
LIMIT 1
FORMAT Vertical

Query id: 0e8808de-11dc-4400-9d41-ff9ce251d497

Row 1:
──────
type:       Commit
event_date: 2022-11-23
event_time: 2022-11-23 17:20:58.556038
thread_id:  17219
query_id:
tid:        (138,35,'6b4435e1-7fe0-4c34-bab4-742c1ba9791c')
tid_hash:   10837109814998401029
csn:        144
database:
table:
uuid:       00000000-0000-0000-0000-000000000000
part:

Expected behavior

INSERT should commit all changes (when no BEGIN TRANSACTION was fired) and there shouldn't be any active transaction that is blocking newly created parts.

tavplubix commented 1 year ago

Most likely it's because of concurrent merges. Grep 10837109814998401029 in server logs.

filimonov commented 1 year ago

Tryng to test it on my laptop (master from couple day ago, and config 'as in CI')

On the DROP line (table does not exist)

[laptop-5591] 2022.11.29 11:16:27.668941 [ 116260 ] <Fatal> BaseDaemon: ########################################
[laptop-5591] 2022.11.29 11:16:27.669014 [ 116260 ] <Fatal> BaseDaemon: (version 22.12.1.593 (official build), build id: F31AA168B1B94907A2A1718A3A8650144099200D) (from thread 115301) (query_id: d51b1eb9-3cfc-4687-8fc5-7724cac1bbba) (query: DROP TABLE IF EXISTS t) Received signal Segmentation fault (11)
[laptop-5591] 2022.11.29 11:16:27.669037 [ 116260 ] <Fatal> BaseDaemon: Address: NULL pointer. Access: read. Address not mapped to object.
[laptop-5591] 2022.11.29 11:16:27.669059 [ 116260 ] <Fatal> BaseDaemon: Stack trace: 0x13cec939 0x13cf2bb7 0x13cf2212 0x13cf1728 0x141bd873 0x141baaad 0x14e6988e 0x14e7cd99 0x17cd23b4 0x17cd3d5b 0x17e62b67 0x17e6059d 0x7f2e0b0d0b43 0x7f2e0b162a00
[laptop-5591] 2022.11.29 11:16:27.731091 [ 116260 ] <Fatal> BaseDaemon: 2. ./build_docker/../src/Interpreters/IInterpreter.cpp:37: DB::IInterpreter::checkStorageSupportsTransactionsIfNeeded(std::__1::shared_ptr<DB::IStorage> const&, std::__1::shared_ptr<DB::Context const>) @ 0x13cec939 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:27.858892 [ 116260 ] <Fatal> BaseDaemon: 3.1. inlined from ./build_docker/../contrib/libcxx/include/__memory/shared_ptr.h:833: std::__1::shared_ptr<DB::IDatabase>::operator bool[abi:v15003]() const
[laptop-5591] 2022.11.29 11:16:27.858938 [ 116260 ] <Fatal> BaseDaemon: 3. ./build_docker/../src/Interpreters/InterpreterDropQuery.cpp:126: DB::InterpreterDropQuery::executeToTableImpl(std::__1::shared_ptr<DB::Context const>, DB::ASTDropQuery&, std::__1::shared_ptr<DB::IDatabase>&, StrongTypedef<wide::integer<128ul, unsigned int>, DB::UUIDTag>&) @ 0x13cf2bb7 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:27.871746 [ 116260 ] <Fatal> BaseDaemon: 4. ./build_docker/../src/Interpreters/InterpreterDropQuery.cpp:93: DB::InterpreterDropQuery::executeToTable(DB::ASTDropQuery&) @ 0x13cf2212 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:27.882972 [ 116260 ] <Fatal> BaseDaemon: 5. ./build_docker/../src/Interpreters/InterpreterDropQuery.cpp:0: DB::InterpreterDropQuery::execute() @ 0x13cf1728 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.017572 [ 116260 ] <Fatal> BaseDaemon: 6. ./build_docker/../src/Interpreters/executeQuery.cpp:0: DB::executeQueryImpl(char const*, char const*, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x141bd873 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.054797 [ 116260 ] <Fatal> BaseDaemon: 7. ./build_docker/../src/Interpreters/executeQuery.cpp:1082: DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x141baaad in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.174044 [ 116260 ] <Fatal> BaseDaemon: 8. ./build_docker/../src/Server/TCPHandler.cpp:0: DB::TCPHandler::runImpl() @ 0x14e6988e in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.233965 [ 116260 ] <Fatal> BaseDaemon: 9. ./build_docker/../src/Server/TCPHandler.cpp:1904: DB::TCPHandler::run() @ 0x14e7cd99 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.237444 [ 116260 ] <Fatal> BaseDaemon: 10. ./build_docker/../contrib/poco/Net/src/TCPServerConnection.cpp:57: Poco::Net::TCPServerConnection::start() @ 0x17cd23b4 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.256333 [ 116260 ] <Fatal> BaseDaemon: 11.1. inlined from ./build_docker/../contrib/libcxx/include/__memory/unique_ptr.h:48: std::__1::default_delete<Poco::Net::TCPServerConnection>::operator()[abi:v15003](Poco::Net::TCPServerConnection*) const
[laptop-5591] 2022.11.29 11:16:28.256384 [ 116260 ] <Fatal> BaseDaemon: 11.2. inlined from ./build_docker/../contrib/libcxx/include/__memory/unique_ptr.h:305: std::__1::unique_ptr<Poco::Net::TCPServerConnection, std::__1::default_delete<Poco::Net::TCPServerConnection>>::reset[abi:v15003](Poco::Net::TCPServerConnection*)
[laptop-5591] 2022.11.29 11:16:28.256404 [ 116260 ] <Fatal> BaseDaemon: 11.3. inlined from ./build_docker/../contrib/libcxx/include/__memory/unique_ptr.h:259: ~unique_ptr
[laptop-5591] 2022.11.29 11:16:28.256418 [ 116260 ] <Fatal> BaseDaemon: 11. ./build_docker/../contrib/poco/Net/src/TCPServerDispatcher.cpp:116: Poco::Net::TCPServerDispatcher::run() @ 0x17cd3d5b in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.283465 [ 116260 ] <Fatal> BaseDaemon: 12. ./build_docker/../contrib/poco/Foundation/src/ThreadPool.cpp:213: Poco::PooledThread::run() @ 0x17e62b67 in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.290444 [ 116260 ] <Fatal> BaseDaemon: 13.1. inlined from ./build_docker/../contrib/poco/Foundation/include/Poco/SharedPtr.h:277: Poco::SharedPtr<Poco::Runnable, Poco::ReferenceCounter, Poco::ReleasePolicy<Poco::Runnable>>::get()
[laptop-5591] 2022.11.29 11:16:28.290502 [ 116260 ] <Fatal> BaseDaemon: 13.2. inlined from ./build_docker/../contrib/poco/Foundation/include/Poco/SharedPtr.h:156: Poco::SharedPtr<Poco::Runnable, Poco::ReferenceCounter, Poco::ReleasePolicy<Poco::Runnable>>::assign(Poco::Runnable*)
[laptop-5591] 2022.11.29 11:16:28.290523 [ 116260 ] <Fatal> BaseDaemon: 13.3. inlined from ./build_docker/../contrib/poco/Foundation/include/Poco/SharedPtr.h:208: Poco::SharedPtr<Poco::Runnable, Poco::ReferenceCounter, Poco::ReleasePolicy<Poco::Runnable>>::operator=(Poco::Runnable*)
[laptop-5591] 2022.11.29 11:16:28.290535 [ 116260 ] <Fatal> BaseDaemon: 13. ./build_docker/../contrib/poco/Foundation/src/Thread_POSIX.cpp:360: Poco::ThreadImpl::runnableEntry(void*) @ 0x17e6059d in /usr/bin/clickhouse
[laptop-5591] 2022.11.29 11:16:28.290562 [ 116260 ] <Fatal> BaseDaemon: 14. ? @ 0x7f2e0b0d0b43 in ?
[laptop-5591] 2022.11.29 11:16:28.290578 [ 116260 ] <Fatal> BaseDaemon: 15. ? @ 0x7f2e0b162a00 in ?
[laptop-5591] 2022.11.29 11:16:28.501938 [ 116260 ] <Fatal> BaseDaemon: Integrity check of the executable successfully passed (checksum: 637D90EFECAF8B869E3E5ECEC1E74504)

:)

filimonov commented 1 year ago

Easy to reproduce locally (i've increased block sizes to reduce the noise caused by high number of parts)

SET implicit_transaction = true, throw_on_unsupported_query_inside_transaction = false, min_insert_block_size_bytes=1000000000, min_insert_block_size_rows=100000000;

DROP TABLE IF EXISTS t;

CREATE TABLE t
(
    `id` int,
    `c1` int
)
ENGINE = MergeTree
ORDER BY id;

INSERT INTO t (id, c1) SELECT
    number,
    number % 30
FROM numbers(1000000000);

/*
SELECT sleep(2);
SYSTEM FLUSH LOGS;
SELECT * FROM system.merges WHERE  database = currentDatabase() AND table='t';
SELECT * FROM system.transactions_info_log WHERE event_time > now() - 60 and database <> 'system' AND query_id <> '' ORDER BY event_time  FORMAT PrettyCompactMonoBlock;
SELECT now();
SELECT name, active, modification_time from system.parts where (database = currentDatabase()) AND (table = 't') order by modification_time;
*/

ALTER TABLE t
    DELETE WHERE c1 IN (
        SELECT DISTINCT c1
        FROM t
        WHERE (c1 % 17) = 0
    )
SETTINGS mutations_sync = 2;

Yes, obviously, there is a merge started/happening at the same time as mutation.