apache / drill

Apache Drill is a distributed MPP query layer for self describing data
https://drill.apache.org/
Apache License 2.0
1.93k stars 980 forks source link

How do you configure "Actively Used Direct (Estimate)" memory in "Metrics" tab of Drill Web UI? #2789

Open wschung1113 opened 1 year ago

wschung1113 commented 1 year ago

Hello,

I am deploying drillbits as kubernetes pods on my kubernetes cluster. However, my drillbit pod doesn't seem like it's using direct memory for querying quite a large data set (180 million rows, 27 gb of size) as below: image

Reading the official documents, I have configured drill-env.sh as such: image

Also the pod configurations: image

{{.Values.drill.memory}} is set to 4gbs at the moment.

Any clues where I should configure to use up some direct memory and increase performance of query execution?

Thanks for reading!

jnturton commented 1 year ago

A good place to find data relevant to query performance is at the query profile which can be found on the Profiles page of the web UI. What format is your data in? What kind of query are you running - grouping, joining, filtering?

wschung1113 commented 1 year ago

Thanks for a reply!

Most of the time is taken at query execution: image

Data format is just a ordinary row-based table from a rdbms (Oracle alike).

Query should look like this, rather simple query with an outer join and few group by clauses. However the data is relatively large so it takes about 4 minutes and I want to try to push it down.

WITH DEF_TABLE AS ( SELECT * FROM rdbms.hyperdata_ex.DO171 ) SELECT M_0 FROM ( SELECT L_SHIPINSTRUCT AS C0_0, L_SHIPMODE AS C1_0, SUM(L_EXTENDEDPRICE) AS M_0 FROM DEF_TABLE GROUP BY L_SHIPINSTRUCT, L_SHIPMODE ) TARGET

RIGHT OUTER JOIN ( SELECT * FROM ( SELECT L_SHIPINSTRUCT AS C0_0, L_SHIPINSTRUCT AS K0_0 FROM DEF_TABLE GROUP BY L_SHIPINSTRUCT ), ( SELECT L_SHIPMODE AS C1_0, L_SHIPMODE AS K1_0 FROM DEF_TABLE GROUP BY L_SHIPMODE ) ) BASE ON ( (TARGET.C0_0 = BASE.C0_0) AND (TARGET.C1_0 = BASE.C1_0) ) ORDER BY K0_0 ASC, K1_0 ASC

wschung1113 commented 1 year ago

"Direct Memory Usage" cap is set to 0GB eventhough I configured drill-env.sh... image

Not sure why... JVM heap memory looks like is configured via drill-env.sh image

jnturton commented 1 year ago

Okay.

  1. After doing some tests I think there may be a bug affecting the direct memory counter in the most recent version of Drill.
  2. If you hover over the direct memory usage info icon you'll see that, unlike for the heap, Drill reports current usage as a percentage of peak usage, not of the configured capacity.
  3. Your query is probably getting executed entirely, or almost, on the RDBMS. Drill tries to "push down" query execution to the source system when it can. You can prevent this by running two queries, first a CTAS that fetches data from DEF_TABLE without performing any operations and writes it somewhere, perhaps the temp workspace. Secondly, do all of the rest of the logic in a SELECT based on the table that was created by the CTAS. Note however that this will require 180m rows to be transferred from the RDBMS to Drill over JDBC which will itself need a lot of time (but may not always need to be repeated).

I'll report back here about the direct memory counter.

jnturton commented 1 year ago

An update on the direct memory counter - it's not broken. Can we close this issue now that information about direct memory usage being a percentage of peak usage has been provided? Please feel free to share a JSON query profile on Slack if you'd like to talk more about which operators are responsible for most of the query execution time.