delete by Kudu pk column seems using table scan

lightsailpro commented 5 years ago

I am trying to figure out why a single record "delete" by Kudu primary key column needs a table scan in Presto, which takes a couple minutes on a table with about 80 million records. From my understanding, a delete by primary key in Kudu should be very quick. Is there any kudu connector setting to control this behavior. By the way, I am using Kudu 1.4 with Presto 0.214. See below for test result. Thanks in advance.

================================================= presto:default> describe test1; Column | Type | Extra | -----------------+-----------+-------------------------------------------------+ usage_token | varchar | primary_key, encoding=auto, compression=default | access_datetime | timestamp | primary_key, encoding=auto, compression=default | kafka_uuid | varchar | primary_key, encoding=auto, compression=default | ingestion_dt | varchar | nullable, encoding=auto, compression=default |

--retrieve is quick by pk presto:default> select usage_token, access_datetime, kafka_uuid -> from test1 -> where usage_token = '1000a052-c2d7-42da-a71d-fe5d0299dc38' and access_datetime = cast('2018-04-30 04:04:19.019' as timestamp) and kafka_uuid = '08d65698-54f0-b6a4-7c23-01631594c87e'; usage_token | access_datetime | kafka_uuid --------------------------------------+-------------------------+-------------------------------------- 1000a052-c2d7-42da-a71d-fe5d0299dc38 | 2018-04-30 04:04:19.019 | 08d65698-54f0-b6a4-7c23-01631594c87e (1 row)

Query 20181219_155906_00007_aput5, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0:00 [1 rows, 3B] [2 rows/s, 6B/s]

--single record delete on pk is using table scan and takes a couple minutes

presto:default> delete -> from test1 -> where usage_token = '1000a052-c2d7-42da-a71d-fe5d0299dc38' and access_datetime = cast('2018-04-30 04:04:19.019' as timestamp) and kafka_uuid = '08d65698-54f0-b6a4-7c23-01631594c87e';

presto:default> delete from test1 where usage_token = '0000a052-c2d7-42da-a71d-fe5d0299dc38' and access_datetime = cast('2018-04-30 04:04:19.019' as timestamp) and kafka_uuid = '08d65698-54f0-b6a4-7c23-01631594c87e';

Query 20181219_160900_00009_aput5, RUNNING, 9 nodes, 25 splits 0:09 [5.55M rows, 21.2MB] [ 603K rows/s, 2.3MB/s] [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 0%

 STAGES   ROWS  ROWS/s  BYTES  BYTES/s  QUEUED    RUN   DONE

0.........R 0 0 0B 0B 0 17 0 1.......R 5.55M 603K 21.2M 2.3M 0 8 0

Query 20181219_160119_00008_aput5, FINISHED, 9 nodes Splits: 25 total, 25 done (100.00%) 2:32 [76.8M rows, 293MB] [505K rows/s, 1.93MB/s]

presto:default>

kokosing commented 5 years ago

CC: @MartinWeindel

kokosing commented 5 years ago

@lightsailpro I think you should raise this issue in https://github.com/prestodb/presto, as Kudu connector is a native Presto connector.

MartinWeindel commented 5 years ago

@lightsailpro I have tried to reproduce the problem, but in my example it uses a ScanFilter and not a table scan.

Please run the following command and provide the output:

explain delete from test1 where usage_token = '0000a052-c2d7-42da-a71d-fe5d0299dc38' and access_datetime = cast('2018-04-30 04:04:19.019' as timestamp) and kafka_uuid = '08d65698-54f0-b6a4-7c23-01631594c87e';

It would also be helpful to have the create table statement, which you get with SHOW CREATE TABLE test1;

As kokosing already said, please open an issue in https://github.com/prestodb/presto

MartinWeindel commented 5 years ago

closed as not provided any details

MartinWeindel / presto-kudu

delete by Kudu pk column seems using table scan #13