Closed kangkaisen closed 5 years ago
hi @kangkaisen , Is there any failure log in be?
hi @kangkaisen , Is there any failure log in be?
I remember there was not any failure log in be.
When I restarted a cluster at 1.9 night , this issue happened again. So I deep into this issue yesterday.
The FE error Log:
2019-01-10 14:00:22,198 WARN 2142 [AgentBatchTask.run():133] task exec error. backend[10001] org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77) ~[libthrift-0.9.3.jar:0.9.3] at org.apache.doris.thrift.BackendService$Client.recv_submit_tasks(BackendService.java:256) ~[palo-fe.jar:?] at org.apache.doris.thrift.BackendService$Client.submit_tasks(BackendService.java:243) ~[palo-fe.jar:?] at org.apache.doris.task.AgentBatchTask.run(AgentBatchTask.java:124) [palo-fe.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_112] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
And no any create table log in BE.
the exception in TIOStreamTransport.java:132 is
if (bytesRead < 0) { throw new TTransportException(TTransportException.END_OF_FILE); }
I tried reproducing this issue in our dev cluster:
I use netstat -antp | grep 9060
check the TCP connection status, I found the BE TCP connection status is FIN_WAIT_2 and FE TCP connection status is CLOSE_WAIT.
So when we use the BackendService.Client that TCP connection is CLOSE_WAIT, the TTransportException will occur.
So I think when the TCP connection is CLOSE_WAIT status, we should close the old TCP connection and create a new TCP connection.
Thanks you. I think ac01da49847a5a is enough. https://github.com/apache/incubator-doris/pull/408 is more better than my PR. I will test it.
After cherry-pick https://github.com/apache/incubator-doris/pull/570, Creating table or partition still need a long time (several minutes), which is abnormal.
After I debug in our prod env with pstack
and gdb
, I found the start_trash_sweep
operation hold the _store_lock
mutex.
#0 0x00007f957083f855 in __getdents64 () from /lib64/libc.so.6
#1 0x00007f957083f5a6 in readdir_r () from /lib64/libc.so.6
#2 0x000000000159e5ae in boost::filesystem::detail::directory_iterator_increment(boost::filesystem::directory_iterator&, boost::system::error_code*) ()
#3 0x0000000000c67edc in increment (this=<optimized out>)
at /home/kangkaisen/palo/thirdparty/installed/include/boost/filesystem/operations.hpp:939
#4 increment<boost::filesystem::directory_iterator> (f=...)
at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:555
#5 operator++ (this=<optimized out>)
at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:665
#6 increment (ec=0x0, this=0xd7171b960)
at /home/kangkaisen/palo/thirdparty/installed/include/boost/filesystem/operations.hpp:1101
#7 increment (this=<synthetic pointer>)
at /home/kangkaisen/palo/thirdparty/installed/include/boost/filesystem/operations.hpp:1285
#8 increment<boost::filesystem::recursive_directory_iterator> (f=...)
at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:555
#9 operator++ (this=<synthetic pointer>)
at /home/kangkaisen/palo/thirdparty/installed/include/boost/iterator/iterator_facade.hpp:665
#10 doris::OLAPEngine::_get_root_path_capacity (this=this@entry=0x64e7c00, root_path=..., data_used=data_used@entry=0xc7edc1170,
disk_available=disk_available@entry=0xc7edc1168) at /home/kangkaisen/palo/be/src/olap/olap_engine.cpp:588
#11 0x0000000000c68512 in doris::OLAPEngine::get_all_root_path_info (this=this@entry=0x64e7c00,
root_paths_info=root_paths_info@entry=0x7f9561b09660) at /home/kangkaisen/palo/be/src/olap/olap_engine.cpp:429
#12 0x0000000000c686b3 in doris::OLAPEngine::start_trash_sweep (this=this@entry=0x64e7c00, usage=usage@entry=0x7f9561b0bf58)
at /home/kangkaisen/palo/be/src/olap/olap_engine.cpp:1740
#13 0x0000000000c83135 in doris::OLAPEngine::_garbage_sweeper_thread_callback (this=0x64e7c00, arg=arg@entry=0x0)
at /home/kangkaisen/palo/be/src/olap/olap_server.cpp:167
#14 0x0000000000c832df in operator() (__closure=<optimized out>) at /home/kangkaisen/palo/be/src/olap/olap_server.cpp:46
#15 __invoke_impl<void, doris::OLAPEngine::_start_bg_worker()::<lambda()> > (__f=...)
at /usr/local/include/c++/7.2.0/bits/invoke.h:60
#16 __invoke<doris::OLAPEngine::_start_bg_worker()::<lambda()> > (__fn=...) at /usr/local/include/c++/7.2.0/bits/invoke.h:95
#17 _M_invoke<0> (this=<optimized out>) at /usr/local/include/c++/7.2.0/thread:234
#18 operator() (this=<optimized out>) at /usr/local/include/c++/7.2.0/thread:243
I changed the conf min_garbage_sweep_interval
to 86400. the elapsed time for creating table or partition became normal.
Do you have this commit 7ac011571fde135b622660ecb723814f4f4a78ea, which is used to improve trash_sweep performance. https://github.com/apache/incubator-doris/pull/349 Cherry-pick this commit will address the question.
@chaoyli I don't have https://github.com/apache/incubator-doris/commit/7ac011571fde135b622660ecb723814f4f4a78ea commit. I will try it, thanks you.
After https://github.com/apache/incubator-doris/pull/466/commits/cc943b3abfcb010d213bbd495803eb20f0182ee5, https://github.com/apache/incubator-doris/commit/ac01da49847a5ad7e584fe583f467d2e91ef7bf9, https://github.com/apache/incubator-doris/pull/570/commits/bf70e303a22298162100525238270c4305dd702c, https://github.com/apache/incubator-doris/commit/7ac011571fde135b622660ecb723814f4f4a78ea, This issue has been fixed.
Describe the bug After I cherry-picked https://github.com/apache/incubator-doris/commit/f1b673503e4f9843616950feedb7d0f1ed2cf084 commit to my internal Doris branch and restarted Doris BE.
The Doris BE query is normal, but when I created the table, there was java.net.SocketTimeoutException: Read timed out, even if I have tried many times.
Today, I removed this commit and compile Doris BE, the creating table operation is normal.
I wanted to reproduce this issue in my dev Doris cluster, but I haven't. I guess we need a large Doris cluster to produce this issue.