We have plenty of open pull requests updating query runners, that we are hesitant to merge because there is no way to verify it doesn't introduce a regression. We also had our share of regressions introduced in changes that seemed safe and were merged...
I think it's due time to start looking at setting up integration tests for data sources. We will probably start with the most used once and move forward from there. The community's help here is instrumental.
As a first step, I want to document which data sources have a Docker image we can use:
Data Source
Docker Image
Notes
Amazon Elasticsearch
Y
We can create a test AWS account
Athena
X
We can create a test AWS account
Axibase
?
BigQuery
X
We can create a test account
Cassandra
Y
Not sure how straightforward is the Docker image to use
Clickhouse
Y
Couchbase
?
Databricks
X
We can create a test account, but testing the Hive query runner should be enough.
IBM DB2
?
Apache Drill
?
It's HTTP based, so the very least we can have mocks.
Druid
Y
DynamoDB
X
We can create a test AWS account
Elasticsearch
Y
Google Analytics
X
We can create a test account
Google Spreadsheets
X
We can create a test account
Graphite
?
Hive
?
Impala
?
InfluxDB
Y
JIRA
X
We can create a test account
Kylin
?
MapD
?
MemSQL
?
MongoDB
Y
Microsoft SQL Server
?
MySQL
Y
Oracle
?
PostgreSQL
Y
Phoenix
?
Presto
?
Prometheus
?
Qubole
?
Rockset
X
We can create a test account
Salesforce
X
Snowflake
X
We can create a test account
SQLite
Y
TreasureData
X
Uptycs
?
Vertica
?
Yandex Metrica
?
Some notes:
Because some (most?) tests will require some data for tests, it might make sense to have all the possible data sources setup on a remote server, and test against it. Although this introduces its own set of issues, so need to check.
For each data source, we need to test against multiple versions. Which versions we test against/support should be documented.
We have plenty of open pull requests updating query runners, that we are hesitant to merge because there is no way to verify it doesn't introduce a regression. We also had our share of regressions introduced in changes that seemed safe and were merged...
I think it's due time to start looking at setting up integration tests for data sources. We will probably start with the most used once and move forward from there. The community's help here is instrumental.
As a first step, I want to document which data sources have a Docker image we can use:
Some notes: