Since late 2017, AsterixDB has introduced a feature which allows user to create a primary index on a table (called dataset in AsterixDB), so that a select count(*) from table; query will not need to scan the whole table but only read the primary index.
This article will describe an experiment of how to use this feature and what's the performance gain of it.
Preparation
Create a shell script pingAsterixDB.sh with the following code:
#!/usr/bin/env bash
#===============================================================================
#
# FILE: pingAsterixDB.sh
#
# USAGE: ./pingAsterixDB.sh -a [asterixURL] -s [sql++ statement] -i [interval(s)]
#
# DESCRIPTION: Ping AsterixDB with given sql statement and interval between pings.
# * EXAMPLE:
# ./pingAsterixDB.sh -a http://localhost:19002/query/service \
# -s "select count(*) from Metadata.\'Dataset\'" \
# -i 1
#
# OPTIONS:
# REQUIREMENTS: (1) Python 2.7
# BUGS: ---
# NOTES: ---
# AUTHOR: Qiushi Bai (), baiqiushi@gmail.com
# ORGANIZATION: ics.uci.edu
# CREATED: 4/3/2020 10:00 AM PST
# REVISION: ---
#===============================================================================
set -o nounset # Treat unset variables as an error
asterixURL="http://localhost:19002/query/service"
sql="select count(*) from Metadata.\`Dataset\`"
interval=1
while getopts a:s:i: option
do
case "${option}"
in
a) asterixURL=${OPTARG};;
s) sql=${OPTARG};;
i) interval=${OPTARG};;
esac
done
pingAsterixDB()
{
echo "Ping AsterixDB (${asterixURL}) with query \"${sql}\" ..."
local result=$(curl -v -s --data-urlencode "statement=${sql};" --data "mode=immediate" ${asterixURL} 2>/dev/null)
#echo "Result is :" ${result}
local status=$(echo $result | python -c "import sys, json; print json.load(sys.stdin)['status']" 2>/dev/null)
echo "Query status = " ${status}
local qtime=$(echo $result | python -c "import sys, json; print json.load(sys.stdin)['metrics']['elapsedTime']" 2>/dev/null)
echo "Query time = " ${qtime}
}
while true
do
# date
pingAsterixDB
sleep ${interval}
done
Test the performance BEFORE creating a primary index
Run pingAsterixDB.sh with given the AsterixDB query URL, sql query select count(*) from twitter.ds_tweet; and interval 1 second between each ping. (We use the dataset ds_tweet under dataverse twitter)
./pingAsterixDB.sh -a http://localhost:19002/query/service \
-s "select count(*) from twitter.ds_tweet" \
-i 1
Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB
1
2
3
4
5
avg
query time (seconds)
2.90
2.90
2.84
2.89
2.91
2.89
Result against 50M dataset on TwitterMap backup server AsterixDB
1
2
3
4
5
avg
query time (seconds)
57.17
55.62
56.14
56.23
56.15
56.26
Create a primary index on AsterixDB
The following SQL will create a primary index called foo on the dataset.
CREATE PRIMARY INDEX foo ON twitter.ds_tweet;
Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB
curl -v --data-urlencode "statement=CREATE PRIMARY INDEX foo ON twitter.ds_tweet;" \
> --data mode=async \
> http://localhost:19002/query/service
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 19002 (#0)
> POST /query/service HTTP/1.1
> Host: localhost:19002
> User-Agent: curl/7.61.1
> Accept: */*
> Content-Length: 78
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 78 out of 78 bytes
< HTTP/1.1 200 OK
< transfer-encoding: chunked
< connection: keep-alive
< content-type: application/json; charset=utf-8
< Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept
<
{
"requestID": "67cb5b08-eb46-455e-8bef-22bc80ee467e",
"plans":{},
"metrics": {
"elapsedTime": "4.359672063s",
"executionTime": "4.359406555s",
"resultCount": 0,
"resultSize": 0,
"processedObjects": 0
}
}
* Connection #0 to host localhost left intact
Result against 50M dataset on TwitterMap backup server AsterixDB
curl -v --data-urlencode "statement=CREATE PRIMARY INDEX foo ON twitter.ds_tweet;" --data mode=async http://localhost:19002/query/service
* Trying ::1...
* Connected to localhost (::1) port 19002 (#0)
> POST /query/service HTTP/1.1
> Host: localhost:19002
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Length: 78
> Content-Type: application/x-www-form-urlencoded
>
* upload completely sent off: 78 out of 78 bytes
< HTTP/1.1 200 OK
< transfer-encoding: chunked
< connection: keep-alive
< content-type: application/json; charset=utf-8
< Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept
<
{
"requestID": "dbc564be-f0e8-4a4e-9f63-c66082f8c947",
"plans":{},
"metrics": {
"elapsedTime": "87.664796114s",
"executionTime": "87.664040995s",
"resultCount": 0,
"resultSize": 0,
"processedObjects": 0
}
}
* Connection #0 to host localhost left intact
Test the performance AFTER creating a primary index
Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB
1
2
3
4
5
avg
query time (seconds)
1.78
1.79
1.70
1.78
1.79
1.77
Result against 50M dataset on TwitterMap backup server AsterixDB
Overview
Since late 2017, AsterixDB has introduced a feature which allows user to create a primary index on a table (called dataset in AsterixDB), so that a
select count(*) from table;
query will not need to scan the whole table but only read the primary index.This article will describe an experiment of how to use this feature and what's the performance gain of it.
Preparation
Create a shell script
pingAsterixDB.sh
with the following code:Test the performance BEFORE creating a primary index
Run
pingAsterixDB.sh
with given the AsterixDB query URL, sql queryselect count(*) from twitter.ds_tweet;
and interval 1 second between each ping. (We use the datasetds_tweet
under dataversetwitter
)Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB
Result against 50M dataset on TwitterMap backup server AsterixDB
Create a primary index on AsterixDB
The following SQL will create a primary index called
foo
on the dataset.Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB
Result against 50M dataset on TwitterMap backup server AsterixDB
Test the performance AFTER creating a primary index
Result against 4M dataset on Coronavirus TwitterMap production server AsterixDB
Result against 50M dataset on TwitterMap backup server AsterixDB