4paradigm / OpenMLDB

OpenMLDB is an open-source machine learning database that provides a feature platform computing consistent features for training and inference.
https://openmldb.ai
Apache License 2.0
1.6k stars 320 forks source link

abnormal average time when querying different data volume for the same key #3871

Open gaoboal opened 7 months ago

gaoboal commented 7 months ago

Description During query performance testing, it was found that querying all data rows for a key incurs the least time cost; querying a subset of data within a specified time range for a key results in a relatively increased time cost; querying data for a specific timestamp for a key results in an even greater increase in time cost.

Detail: table:import TalkingData train dataset (180+ million rows) into openmldb query key: ip=88

key other condition rows average time(us)
ip=88 all 4278 9774.375
ip=88 '2017-11-06 00:00:00' <= ts < "2017-11-07 00:00:00" 183 11570.365
ip=88 ts='2017-11-06 16:19:38' 1 16504.145

more result detail: https://qiok3h8ob4.feishu.cn/docx/YkYfdBZm9oVk0MxLFx9co8lLn1g?from=from_copylink

Expected Behavior querying smaller amounts of data should have shorter time costs, or at least not longer than querying larger amounts of data.

Steps to Reproduce

  1. deploy openmldb;
  2. load data (TalkingData train.csv)into table;
  3. find a key with large enough total data volume;
  4. execute queries and calculate the average time cost;
aceforeverd commented 7 months ago

storage design limitation. All records for the same key are seeking linearly, records are orderly by ts value, ts with largest value comes first.