ISG-ICS / cloudberry

Big Data Visualization
http://cloudberry.ics.uci.edu
91 stars 82 forks source link

[AsterixDB] Make SELECT COUNT(*) query faster in AsterixDB #790

Closed baiqiushi closed 3 years ago

baiqiushi commented 4 years ago

Overview

Since late 2017, AsterixDB has introduced a feature which allows user to create a primary index on a table (called dataset in AsterixDB), so that a select count(*) from table; query will not need to scan the whole table but only read the primary index.

This article will describe an experiment of how to use this feature and what's the performance gain of it.

Preparation

Create a shell script pingAsterixDB.sh with the following code:

#!/usr/bin/env bash
#===============================================================================
#
#          FILE: pingAsterixDB.sh
#
#         USAGE: ./pingAsterixDB.sh -a [asterixURL] -s [sql++ statement] -i [interval(s)]
#                           
#   DESCRIPTION: Ping AsterixDB with given sql statement and interval between pings.
#                 * EXAMPLE:
#                   ./pingAsterixDB.sh -a http://localhost:19002/query/service \
#                                      -s "select count(*) from Metadata.\'Dataset\'" \
#                                      -i 1
#
#       OPTIONS: 
#  REQUIREMENTS: (1) Python 2.7
#          BUGS: ---
#         NOTES: ---
#        AUTHOR: Qiushi Bai (), baiqiushi@gmail.com
#  ORGANIZATION: ics.uci.edu
#       CREATED: 4/3/2020 10:00 AM PST
#      REVISION:  ---
#===============================================================================

set -o nounset                              # Treat unset variables as an error
asterixURL="http://localhost:19002/query/service"
sql="select count(*) from Metadata.\`Dataset\`"
interval=1

while getopts a:s:i: option
do
    case "${option}"
        in
        a) asterixURL=${OPTARG};;
        s) sql=${OPTARG};;
        i) interval=${OPTARG};;
    esac
done

pingAsterixDB()
{
    echo "Ping AsterixDB (${asterixURL}) with query \"${sql}\"  ..."
    local result=$(curl -v -s --data-urlencode "statement=${sql};" --data "mode=immediate" ${asterixURL} 2>/dev/null)
    #echo "Result is :" ${result}
    local status=$(echo $result | python -c "import sys, json; print json.load(sys.stdin)['status']" 2>/dev/null)
    echo "Query status = " ${status}
    local qtime=$(echo $result | python -c "import sys, json; print json.load(sys.stdin)['metrics']['elapsedTime']" 2>/dev/null)
    echo "Query time = " ${qtime}
}

while true
do
#   date
    pingAsterixDB
    sleep ${interval}
done

Test the performance BEFORE creating a primary index

Run pingAsterixDB.sh with given the AsterixDB query URL, sql query select count(*) from twitter.ds_tweet; and interval 1 second between each ping. (We use the dataset ds_tweet under dataverse twitter)

./pingAsterixDB.sh -a http://localhost:19002/query/service \
                                  -s "select count(*) from twitter.ds_tweet" \
                                  -i 1

Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB

1 2 3 4 5 avg
query time (seconds) 2.90 2.90 2.84 2.89 2.91 2.89

Result against 50M dataset on TwitterMap backup server AsterixDB

1 2 3 4 5 avg
query time (seconds) 57.17 55.62 56.14 56.23 56.15 56.26

Create a primary index on AsterixDB

The following SQL will create a primary index called foo on the dataset.

CREATE PRIMARY INDEX foo ON twitter.ds_tweet;

Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB

curl -v --data-urlencode "statement=CREATE PRIMARY INDEX foo ON twitter.ds_tweet;" \
>           --data mode=async                      \
>           http://localhost:19002/query/service
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 19002 (#0)
> POST /query/service HTTP/1.1
> Host: localhost:19002
> User-Agent: curl/7.61.1
> Accept: */*
> Content-Length: 78
> Content-Type: application/x-www-form-urlencoded
> 
* upload completely sent off: 78 out of 78 bytes
< HTTP/1.1 200 OK
< transfer-encoding: chunked
< connection: keep-alive
< content-type: application/json; charset=utf-8
< Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept
< 
{
        "requestID": "67cb5b08-eb46-455e-8bef-22bc80ee467e",
        "plans":{},
        "metrics": {
                "elapsedTime": "4.359672063s",
                "executionTime": "4.359406555s",
                "resultCount": 0,
                "resultSize": 0,
                "processedObjects": 0
        }
}
* Connection #0 to host localhost left intact

Result against 50M dataset on TwitterMap backup server AsterixDB

curl -v --data-urlencode "statement=CREATE PRIMARY INDEX foo ON twitter.ds_tweet;" --data mode=async http://localhost:19002/query/service
*   Trying ::1...
* Connected to localhost (::1) port 19002 (#0)
> POST /query/service HTTP/1.1
> Host: localhost:19002
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 78
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 78 out of 78 bytes
< HTTP/1.1 200 OK
< transfer-encoding: chunked
< connection: keep-alive
< content-type: application/json; charset=utf-8
< Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept
<
{
        "requestID": "dbc564be-f0e8-4a4e-9f63-c66082f8c947",
        "plans":{},
        "metrics": {
                "elapsedTime": "87.664796114s",
                "executionTime": "87.664040995s",
                "resultCount": 0,
                "resultSize": 0,
                "processedObjects": 0
        }
}
* Connection #0 to host localhost left intact

Test the performance AFTER creating a primary index

Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB

1 2 3 4 5 avg
query time (seconds) 1.78 1.79 1.70 1.78 1.79 1.77

Result against 50M dataset on TwitterMap backup server AsterixDB

1 2 3 4 5 avg
query time (seconds) 13.49 13.27 13.36 13.38 13.09 13.32
baiqiushi commented 3 years ago

Done.