Create table error: java.net.SocketTimeoutException: Read timed out

kangkaisen commented 5 years ago

Describe the bug After I cherry-picked https://github.com/apache/incubator-doris/commit/f1b673503e4f9843616950feedb7d0f1ed2cf084 commit to my internal Doris branch and restarted Doris BE.

The Doris BE query is normal, but when I created the table, there was java.net.SocketTimeoutException: Read timed out, even if I have tried many times.

Today, I removed this commit and compile Doris BE, the creating table operation is normal.

I wanted to reproduce this issue in my dev Doris cluster, but I haven't. I guess we need a large Doris cluster to produce this issue.

kangpinghuang commented 5 years ago

hi @kangkaisen , Is there any failure log in be?

kangkaisen commented 5 years ago

hi @kangkaisen , Is there any failure log in be?

I remember there was not any failure log in be.

kangkaisen commented 5 years ago

When I restarted a cluster at 1.9 night , this issue happened again. So I deep into this issue yesterday.

kangkaisen commented 5 years ago

The FE error Log: 2019-01-10 14:00:22,198 WARN 2142 [AgentBatchTask.run():133] task exec error. backend[10001] org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.doris.thrift.BackendService$Client.recv_submit_tasks(BackendService.java:256) ~[palo-fe.jar:?] at org.apache.doris.thrift.BackendService$Client.submit_tasks(BackendService.java:243) ~[palo-fe.jar:?] at org.apache.doris.task.AgentBatchTask.run(AgentBatchTask.java:124) [palo-fe.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_112] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112] And no any create table log in BE.

the exception in TIOStreamTransport.java:132 is

if (bytesRead < 0) { throw new TTransportException(TTransportException.END_OF_FILE); }

I tried reproducing this issue in our dev cluster:

kill -9 all BE precesses
when all BE thrift server start, create the table by mysql client. the TTransportException will occur and create table will timeout.

I use netstat -antp | grep 9060 check the TCP connection status, I found the BE TCP connection status is FIN_WAIT_2 and FE TCP connection status is CLOSE_WAIT.

So when we use the BackendService.Client that TCP connection is CLOSE_WAIT, the TTransportException will occur.

kangkaisen commented 5 years ago

So I think when the TCP connection is CLOSE_WAIT status, we should close the old TCP connection and create a new TCP connection.

morningman commented 5 years ago

You can check these 2 commits: https://github.com/apache/incubator-doris/commit/b5737ee59a4fbced6ca61b748216fb15140cbe99 https://github.com/apache/incubator-doris/commit/ac01da49847a5ad7e584fe583f467d2e91ef7bf9

kangkaisen commented 5 years ago

You can check these 2 commits: b5737ee ac01da4

Thanks you. I think ac01da49847a5a is enough. https://github.com/apache/incubator-doris/pull/408 is more better than my PR. I will test it.

kangkaisen commented 5 years ago

After cherry-pick https://github.com/apache/incubator-doris/pull/570, Creating table or partition still need a long time (several minutes)， which is abnormal.

kangkaisen commented 5 years ago

After I debug in our prod env with pstack and gdb, I found the start_trash_sweep operation hold the _store_lock mutex.


  #0  0x00007f957083f855 in __getdents64 () from /lib64/libc.so.6
#1  0x00007f957083f5a6 in readdir_r () from /lib64/libc.so.6
#2  0x000000000159e5ae in boost::filesystem::detail::directory_iterator_increment(boost::filesystem::directory_iterator&, boost::system::error_code*) ()
#3  0x0000000000c67edc in increment (this=<optimized out>)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/filesystem/operations.hpp:939
#4  increment<boost::filesystem::directory_iterator> (f=...)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:555
#5  operator++ (this=<optimized out>)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:665
#6  increment (ec=0x0, this=0xd7171b960)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/filesystem/operations.hpp:1101
#7  increment (this=<synthetic pointer>)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/filesystem/operations.hpp:1285
#8  increment<boost::filesystem::recursive_directory_iterator> (f=...)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:555
#9  operator++ (this=<synthetic pointer>)
    at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:665
#10 doris::OLAPEngine::_get_root_path_capacity (this=this@entry=0x64e7c00, root_path=..., data_used=data_used@entry=0xc7edc1170,
    disk_available=disk_available@entry=0xc7edc1168) at /home/kangkaisen/palo/be/src/olap/olap_engine.cpp:588
#11 0x0000000000c68512 in doris::OLAPEngine::get_all_root_path_info (this=this@entry=0x64e7c00,
    root_paths_info=root_paths_info@entry=0x7f9561b09660) at /home/kangkaisen/palo/be/src/olap/olap_engine.cpp:429
#12 0x0000000000c686b3 in doris::OLAPEngine::start_trash_sweep (this=this@entry=0x64e7c00, usage=usage@entry=0x7f9561b0bf58)
    at /home/kangkaisen/palo/be/src/olap/olap_engine.cpp:1740
#13 0x0000000000c83135 in doris::OLAPEngine::_garbage_sweeper_thread_callback (this=0x64e7c00, arg=arg@entry=0x0)
    at /home/kangkaisen/palo/be/src/olap/olap_server.cpp:167
#14 0x0000000000c832df in operator() (__closure=<optimized out>) at /home/kangkaisen/palo/be/src/olap/olap_server.cpp:46
#15 __invoke_impl<void, doris::OLAPEngine::_start_bg_worker()::<lambda()> > (__f=...)
    at /usr/local/include/c++/7.2.0/bits/invoke.h:60
#16 __invoke<doris::OLAPEngine::_start_bg_worker()::<lambda()> > (__fn=...) at /usr/local/include/c++/7.2.0/bits/invoke.h:95
#17 _M_invoke<0> (this=<optimized out>) at /usr/local/include/c++/7.2.0/thread:234
#18 operator() (this=<optimized out>) at /usr/local/include/c++/7.2.0/thread:243

kangkaisen commented 5 years ago

I changed the conf min_garbage_sweep_interval to 86400. the elapsed time for creating table or partition became normal.

chaoyli commented 5 years ago

Do you have this commit 7ac011571fde135b622660ecb723814f4f4a78ea, which is used to improve trash_sweep performance. https://github.com/apache/incubator-doris/pull/349 Cherry-pick this commit will address the question.

kangkaisen commented 5 years ago

@chaoyli I don't have https://github.com/apache/incubator-doris/commit/7ac011571fde135b622660ecb723814f4f4a78ea commit. I will try it, thanks you.

kangkaisen commented 5 years ago

After https://github.com/apache/incubator-doris/pull/466/commits/cc943b3abfcb010d213bbd495803eb20f0182ee5, https://github.com/apache/incubator-doris/commit/ac01da49847a5ad7e584fe583f467d2e91ef7bf9, https://github.com/apache/incubator-doris/pull/570/commits/bf70e303a22298162100525238270c4305dd702c, https://github.com/apache/incubator-doris/commit/7ac011571fde135b622660ecb723814f4f4a78ea, This issue has been fixed.

apache / doris

Create table error: java.net.SocketTimeoutException: Read timed out #458