cmu-db / noisepage

Self-Driving Database Management System from Carnegie Mellon University
https://noise.page
MIT License
1.75k stars 502 forks source link

Unable to Return Large Query Results to Client #785

Open apavlo opened 4 years ago

apavlo commented 4 years ago

We are not able to return the results for queries that exceed a certain size. The server kills the connection. Sometimes there are client-side errors that mention reading invalid packets.

To reproduce, load the TPC-C database with scalefactor=1. This is will add 30k tuples to the CUSTOMER table:

terrier=# SELECT count(*) FROM customer;
 COUNT STAR
------------
      30000
(1 row)

Then try to read the entire table back:

terrier=# SELECT * FROM customer;
lost synchronization with server: got message type "b", length 1935892850
The connection to the server was lost. Attempting reset: Succeeded.

The server reports:

[2020-02-19 14:00:39.198] [network_logger] [error] Error writing: Connection reset by peer
[2020-02-19 14:00:39.198] [network_logger] [error] Error when filling read buffer

Subsequent invocations produce different client-side errors:

Attempt 2

terrier=# SELECT * FROM customer;
insufficient data in "D" message
lost synchronization with server: got message type "x", length 2036622960
The connection to the server was lost. Attempting reset: Succeeded.

Attempt 3

terrier=# SELECT * FROM customer;
lost synchronization with server: got message type "k", length 1952801604
The connection to the server was lost. Attempting reset: Succeeded.

Attempt 4 On the fourth attempt, the client just hangs for ever.

mbutrovich commented 4 years ago

Just an update, I ran into this while testing YCSB with oltpbench for #739. The query is a select * on the table, and eventually ends up writing garbage while returning the data. I've attached a Wireshark trace that corresponds to the screenshot.

Screen Shot 2020-04-06 at 9 22 56 PM

ycsb_simplequery.pcapng.zip

apavlo commented 4 years ago

As of 2020-08-27, this is still an issue:

terrier: /home/pavlo/Documents/Peloton/Github/terrier/src/include/network/packet_writer.h:82: terrier::network::PacketWriter& terrier::network::PacketWriter::AppendRaw(const void*, size_t): Assertion `(!IsPacketEmpty()) && ("packet length is null")' failed.
terrier: /home/pavlo/Documents/Peloton/Github/terrier/src/include/network/packet_writer.h:63: terrier::network::PacketWriter& terrier::network::PacketWriter::BeginPacket(terrier::network::NetworkMessageType): Assertion `(IsPacketEmpty()) && ("packet length is null")' failed.

Stack Trace

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff52308b1 in __GI_abort () at abort.c:79
#2  0x00007ffff522042a in __assert_fail_base (fmt=0x7ffff53a7a38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x555558f60900 "(!IsPacketEmpty()) && (\"packet length is null\")", 
    file=file@entry=0x555558f60820 "/home/pavlo/Documents/Peloton/Github/terrier/src/include/network/packet_writer.h", line=line@entry=82, 
    function=function@entry=0x555558f630c0 <terrier::network::PacketWriter::AppendRaw(void const*, unsigned long)::__PRETTY_FUNCTION__> "terrier::network::PacketWriter& terrier::network::PacketWriter::AppendRaw(const void*, size_t)") at assert.c:92
#3  0x00007ffff52204a2 in __GI___assert_fail (assertion=0x555558f60900 "(!IsPacketEmpty()) && (\"packet length is null\")", 
    file=0x555558f60820 "/home/pavlo/Documents/Peloton/Github/terrier/src/include/network/packet_writer.h", line=82, 
    function=0x555558f630c0 <terrier::network::PacketWriter::AppendRaw(void const*, unsigned long)::__PRETTY_FUNCTION__> "terrier::network::PacketWriter& terrier::network::PacketWriter::AppendRaw(const void*, size_t)")
    at assert.c:101
#4  0x00005555561d8a95 in terrier::network::PacketWriter::AppendRaw (this=0x7fff6c4a5580, src=0x7ffce2f32964, len=4) at /home/pavlo/Documents/Peloton/Github/terrier/src/include/network/packet_writer.h:82
#5  0x00005555561dbc59 in terrier::network::PacketWriter::AppendRawValue<int> (this=0x7fff6c4a5580, val=-1425997824) at /home/pavlo/Documents/Peloton/Github/terrier/src/include/network/packet_writer.h:100
#6  0x00005555561d9ba0 in terrier::network::PacketWriter::AppendValue<int> (this=0x7fff6c4a5580, val=427) at /home/pavlo/Documents/Peloton/Github/terrier/src/include/network/packet_writer.h:128
#7  0x00005555561d0cc9 in terrier::network::PostgresPacketWriter::WriteTextAttribute (this=0x7fff6c4a5580, val=0x627000395290, type=terrier::type::TypeId::VARCHAR)
    at /home/pavlo/Documents/Peloton/Github/terrier/src/network/postgres/postgres_packet_writer.cpp:375
#8  0x00005555561d0416 in terrier::network::PostgresPacketWriter::WriteDataRow (this=0x7fff6c4a5580, tuple=0x627000395100, columns=std::vector of length 21, capacity 32 = {...}, 
    field_formats=std::vector of length 1, capacity 1 = {...}) at /home/pavlo/Documents/Peloton/Github/terrier/src/network/postgres/postgres_packet_writer.cpp:251
#9  0x0000555556a4fe05 in terrier::execution::exec::OutputWriter::operator() (this=0x603006eaf2b0, tuples=0x627000395100, num_tuples=32, tuple_size=424)
    at /home/pavlo/Documents/Peloton/Github/terrier/src/execution/exec/output.cpp:98
#10 0x00005555565b520c in std::_Function_handler<void (std::byte*, unsigned int, unsigned int), terrier::execution::exec::OutputWriter>::_M_invoke(std::_Any_data const&, std::byte*&&, unsigned int&&, unsigned int&&) (
    __functor=..., __args#0=@0x7ffce2f32d90: 0x627000395100, __args#1=@0x7ffce2f32d8c: 32, __args#2=@0x7ffce2f32d88: 424) at /usr/include/c++/7/bits/std_function.h:316
#11 0x0000555556a5051e in std::function<void (std::byte*, unsigned int, unsigned int)>::operator()(std::byte*, unsigned int, unsigned int) const (this=0x606000f39e18, __args#0=0x627000395100, __args#1=32, __args#2=424)
    at /usr/include/c++/7/bits/std_function.h:706
#12 0x0000555556c12c3e in terrier::execution::exec::OutputBuffer::AllocOutputSlot (this=0x606000f39e00) at /home/pavlo/Documents/Peloton/Github/terrier/src/include/execution/exec/output.h:60
#13 0x0000555556c33914 in OpResultBufferAllocOutputRow (result=0x7ffce2f36458, ctx=0x610000337940) at /home/pavlo/Documents/Peloton/Github/terrier/src/include/execution/vm/bytecode_handlers.h:1275
#14 0x0000555556c0828b in terrier::execution::vm::VM::Interpret (this=0x7ffce2f36590, ip=0x61d000a64aec <incomplete sequence \333>, frame=0x7ffce2f365d0)
    at /home/pavlo/Documents/Peloton/Github/terrier/src/execution/vm/vm.cpp:1611
#15 0x0000555556beb7ab in terrier::execution::vm::VM::InvokeFunction (module=0x60400523d0d0, func_id=3, args=0x7ffce2f36650 "\220\061\354\001 `") at /home/pavlo/Documents/Peloton/Github/terrier/src/execution/vm/vm.cpp:112
#16 0x00007ffff2c06032 in ?? ()
#17 0x0000602001ec3190 in ?? ()
#18 0x000060a00053af40 in ?? ()
#19 0x00007ffce2f366e0 in ?? ()
#20 0x0000555556ab326a in terrier::execution::sql::ThreadStateContainer::AccessCurrentThreadState (this=0xfff9c5e6f0c) at /home/pavlo/Documents/Peloton/Github/terrier/src/execution/sql/thread_state_container.cpp:87
#21 0x000055555792b5b5 in tbb::interface9::internal::start_for<tbb::blocked_range<unsigned int>, terrier::execution::sql::(anonymous namespace)::ScanTask, tbb::auto_partitioner const>::run_body (this=0x7ffff1037d40, r=...)
    at /usr/include/tbb/parallel_for.h:102
#22 0x000055555792b164 in tbb::interface9::internal::balancing_partition_type<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type> >::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range<unsigned int>, terrier::execution::sql::(anonymous namespace)::ScanTask, tbb::auto_partitioner const>, tbb::blocked_range<unsigned int> > (this=0x7ffff1037d90, start=..., range=...)
    at /usr/include/tbb/partitioner.h:429
#23 0x000055555792af75 in tbb::interface9::internal::partition_type_base<tbb::interface9::internal::auto_partition_type>::execute<tbb::interface9::internal::start_for<tbb::blocked_range<unsigned int>, terrier::execution::sql::(anonymous namespace)::ScanTask, tbb::auto_partitioner const>, tbb::blocked_range<unsigned int> > (this=0x7ffff1037d90, start=..., range=...) at /usr/include/tbb/partitioner.h:255
#24 0x000055555792ad2e in tbb::interface9::internal::start_for<tbb::blocked_range<unsigned int>, terrier::execution::sql::(anonymous namespace)::ScanTask, tbb::auto_partitioner const>::execute (this=0x7ffff1037d40)
    at /usr/include/tbb/parallel_for.h:127
#25 0x00007ffff6591b46 in ?? () from /usr/lib/x86_64-linux-gnu/libtbb.so.2
#26 0x00007ffff658aaf8 in ?? () from /usr/lib/x86_64-linux-gnu/libtbb.so.2
#27 0x00007ffff65893db in ?? () from /usr/lib/x86_64-linux-gnu/libtbb.so.2
#28 0x00007ffff6585512 in ?? () from /usr/lib/x86_64-linux-gnu/libtbb.so.2
#29 0x00007ffff6585769 in ?? () from /usr/lib/x86_64-linux-gnu/libtbb.so.2
#30 0x00007ffff6c026db in start_thread (arg=0x7ffce2f38700) at pthread_create.c:463
#31 0x00007ffff5311a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
lmwnshn commented 3 years ago

Not timestamp (I thought it might be related to handling of timestamps)

mbutrovich commented 3 years ago

I think I'm hitting this while trying to run chbenchmark. Going to spend a couple hours tonight trying to understand it.