TsinghuaDatabaseGroup / DB-GPT

An LLM Based Diagnosis System (https://arxiv.org/pdf/2312.01454.pdf)
http://dbgpt.dbmind.cn/
Apache License 2.0
522 stars 74 forks source link

How is workload_sqls obtained in main.py #71

Open andyyosshi opened 8 months ago

andyyosshi commented 8 months ago

Hello,

I am trying to understand about the workload_sqls that is being fed into main.py. Could you explain to me how these SQLs are being derived?

Is the information being extracted from the pg_stat_statements view?

zhouxh19 commented 8 months ago

Yes, you can obtain workload_sqls from pg_stat_statements by filtering out queries over the 'postgres' database (https://github.com/TsinghuaDatabaseGroup/DB-GPT/blob/main/utils/database.py#L330).

However, as mentioned in README, pg_stat_statements does not support time-period-based filtering and the obtained queries may not be within the anomaly period. A better choice is to (1) directly extract queries from the anomaly trigging scripts (for test only) or (2) utilize advanced tools in commercial databases (e.g., AWR report in Oracle).

Queries can be most critical for finding useful root causes. Thus, if you have any better strategies to extract representative historical queries during a time period (as we know, logging all the queries can be unaffordable). Please let us know -:)

andyyosshi commented 8 months ago

Thank you for your detailed response, it was very informative.

Firstly, I appreciate your clear explanation on how to obtain workload_sqls from pg_stat_statements. Your advice on filtering queries on the 'postgres' database was particularly useful.

I also noted your suggestion regarding the issue with time-based filtering. I do not fully understand yet the strategy for extracting representative historical queries over a specific period.

Therefore, I would like to further explore this matter. Thank you very much.