apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.92k stars 985 forks source link

DRILL-8489: Sender memory leak when rpc encode exception #2901

Closed shfshihuafeng closed 2 months ago

shfshihuafeng commented 2 months ago

DRILL-8489: Sender memory leak when rpc encode exception

Description

When encode throw Exception, if encode msg instanceof ReferenceCounted, netty can release msg, but drill convert msg to OutboundRpcMessage, so netty can not release msg.

we can reproduce this scenario by break point and add debug log. Seeing Testing#Test1

Documentation

(Please describe user-visible changes similar to what should appear in the Drill documentation.)

Testing

Test 1

1. set -Ddrill.memory.debug.allocator=TRUE

2. we add debug log as following

DrillByteBufAllocator #DrillByteBufAllocator

    public ByteBuf buffer() {
    File file = new File("/data/shf/b.log");
    if (file.exists()) {
      throw new OutOfMemoryException("shf encode exception");
    }
    return buffer(DEFAULT_BUFFER_SIZE);
  }

3. restart drillbit

image

4. Run tpch sql8

select
o_year,
sum(case when nation = 'CHINA' then volume else 0 end) / sum(volume) as mkt_share
from (
select
extract(year from o_orderdate) as o_year,
l_extendedprice * 1.0 as volume,
n2.n_name as nation
from hive.tpch1s.part, hive.tpch1s.supplier, hive.tpch1s.lineitem, hive.tpch1s.orders, hive.tpch1s.customer, hive.tpch1s.nation n1, hive.tpch1s.nation n2, hive.tpch1s.region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'ASIA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01'
and date '1996-12-31'
and p_type = 'LARGE BRUSHED BRASS') as all_nations
group by o_year
order by o_year;   

5.Break point: BroadcastSenderRootExec#innerNext#tunnels[i].sendRecordBatch(batch); we resume program (F9, idea tool ) until there is memory had been allocated in the writableBatch object shown below

image

6.Break point: MessageToMessageEncoder#encode we resume program (F9, idea tool ) until step 5 writableBatch encode

image
  1. we mkdir "/data/shf/b.log" for debug on step 2

  2. end break point

  3. find memory leak

    image
  4. Check whether the leaked memory id is equal to that allocated by writableBatch

Allocator(frag:4:0) 3000000/1000000/4000512/30000000000 (res/actual/peak/limit)
  child allocators: 1
    Allocator(op:4:0:0:BroadcastSender) 1000000/53408/106816/10000000000 (res/actual/peak/limit)
      child allocators: 0
      ledgers: 5
        ledger[155] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 128, references: 1, life: 2050915810044022..0, allocatorManager: [130, life: 2050915807998314..0] holds 1 buffers.
            DrillBuf[156], udle: [132 0..128]

   ledger[159] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 4096, references: 1, life: 2050915810510561..0, allocatorManager: [138, life: 2050915808701687..0] holds 1 buffers.
            DrillBuf[160],

                    ledger[161] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 32768, references: 1, life: 2050915810690813..0, allocatorManager: [134, life: 2050915808423055..0] holds 1 buffers.
            DrillBuf[162], udle: [135 0..32768]
       event log for: DrillBuf[162]

        ledger[160] allocator: op:4:0:0:BroadcastSender), isOwning: true, size: 16384, references: 1, life: 2050915810616308..0, allocatorManager: [136, life: 2050915808530627..0] holds 1 buffers.
            DrillBuf[161], udle: [137 0..16384]

Test 2

  1. export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"2G"}
  2. tpch 1s
  3. tpch sql 8 4.This scenario is relatively easy to Reproduce by running the following script
    
    drill_home=/data/shf/apache-drill-1.22.0-SNAPSHOT/bin
    fileName=/data/shf/1s/shf.txt

random_sql(){

for i in seq 1 3

while true do num=$((RANDOM%22+1)) if [ -f $fileName ]; then echo "$fileName" " is exit" exit 0 else $drill_home/sqlline -u \"jdbc:drill:zk=jupiter-2:2181/drill_shf/jupiterbits_shf1\" -f tpch_sql8.sql >> sql8.log 2>&1 fi done } main(){ unset HADOOP_CLASSPATH

TPCH power test

for i in seq 1 25 do random_sql & done

}