ClickHouse / metabase-clickhouse-driver

ClickHouse database driver for the Metabase business intelligence front-end
Apache License 2.0
477 stars 92 forks source link

Incorrect count result when querying a View over a Distributed table #142

Closed dvgorbunkov closed 4 months ago

dvgorbunkov commented 1 year ago

HI! I'm using metabase version 0.44.1 with version clickhouse driver 0.8.3. Found a problem that a simple query

select count(distinct(orderId)) from AttributedPartnerOrders 
where partnerId = '...'
and createdDate >= '2023-01-01'
and createdDate <= '2023-01-31'
and clickedMinutesAgo < 1440

produces a result 6046, but if you repeat it, it gives 5971.

The numbers are constantly changing.

Through the standard clickhouse-client, the data is given correctly and the same. In the metabase logs we see only regular lines for this request

DEBUG middleware.log :: POST /api/dataset 202 [ASYNC: completed] 244.3 ms (6 DB calls) App DB connections: 1/15 Jetty threads: 2/50 (5 idle, 0 queued) (100 total active threads) Queries in flight: 1 (0 queued); clickhouse DB 2 connections: 0/1 (0 threads blocked)
slvrtrn commented 1 year ago

Hi, thanks for the report.

Having the sample cell towers dataset, I tried to do something similar using Metabase 0.45.2 and plugin version 1.0.3:

SELECT count(distinct(area))
FROM `default`.`cell_towers`
WHERE radio = 'CDMA' 
AND updated >= '2011-07-01'
AND updated <= '2011-07-31';

It produces the same result in the Metabase interface and the standard clickhouse-client.

Can you please try upgrading the driver (to at least 0.9.0 as the underlying JDBC driver has been changed there) and then check if the issue persists?

Cheers.

dvgorbunkov commented 1 year ago

@slvrtrn In the previous changelog in the driver, I didn't find a mention of my case, that's why I wrote here about the problem. Thanks for the advice. I'll definitely update and come back here with the result.

dvgorbunkov commented 1 year ago

Unfortunately the update didn't help. I installed the latest metabase and the latest clickhouse driver (1.0.3). The problem is reproduced.

{
  "browser-info": {
    "language": "ru-RU",
    "platform": "Linux x86_64",
    "userAgent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
    "vendor": "Google Inc."
  },
  "system-info": {
    "file.encoding": "UTF-8",
    "java.runtime.name": "OpenJDK Runtime Environment",
    "java.runtime.version": "11.0.18+10",
    "java.vendor": "Eclipse Adoptium",
    "java.vendor.url": "https://adoptium.net/",
    "java.version": "11.0.18",
    "java.vm.name": "OpenJDK 64-Bit Server VM",
    "java.vm.version": "11.0.18+10",
    "os.name": "Linux",
    "os.version": "5.10.0-13-amd64",
    "user.language": "en",
    "user.timezone": "GMT"
  },
  "metabase-info": {
    "databases": [
      "h2",
      "clickhouse"
    ],
    "hosting-env": "unknown",
    "application-database": "h2",
    "application-database-details": {
      "database": {
        "name": "H2",
        "version": "1.4.197 (2018-03-18)"
      },
      "jdbc-driver": {
        "name": "H2 JDBC Driver",
        "version": "1.4.197 (2018-03-18)"
      }
    },
    "run-mode": "prod",
    "version": {
      "date": "2023-01-27",
      "tag": "v0.45.2.1",
      "branch": "release-x.45.2.x",
      "hash": "1a59de7"
    },
    "settings": {
      "report-timezone": null
    }
  }
}

P.S. This is very strange, but if you make a request like

select 
  count(*) 
from 
  (
    select 
      distinct(orderId) 
    from 
      AttributedPartnerOrders 
    where 
      partnerId = '...' 
      and createdDate >= '2023-01-01' 
      and createdDate <= '2023-01-31' 
      and clickedMinutesAgo < 1440
  )

then the result is correct and does not change if I repeat it again.

slvrtrn commented 1 year ago

Thanks for trying it out.

That is odd indeed!

Could you provide the table structure, maybe with a limited number of fields (the ones used in the query at least), but with exact ClickHouse types?

I will generate some dummy data and try to reproduce it on my end.

dvgorbunkov commented 1 year ago

Thank you for taking such good care of my question.

CREATE VIEW AttributedPartnerOrders
(
    `partnerId` FixedString(24),
    `visitorId` Nullable(FixedString(24)),
    `createdAt` DateTime,
    `createdDate` Date,
    `itemId` Nullable(Int64),
    `orderId` Nullable(String),
    `qnt` Nullable(Float64),
    `amount` Nullable(Float64),
    `clickedAt` DateTime,
    `clickedMinutesAgo` Float64,
    `blockId` FixedString(32),
    `method` Nullable(String),
    `email` Nullable(String),
    `updatedAt` DateTime
) AS
SELECT *
FROM AttributedPartnerOrdersMaterialized
slvrtrn commented 1 year ago

Thanks. Will have a look.

slvrtrn commented 1 year ago

I still cannot reproduce this.

Here's what I did:

CREATE TABLE AttributedPartnerOrdersMaterialized
(
    `partnerId`         FixedString(24),
    `createdDate`       Date,
    `orderId`           Nullable(String),
    `clickedMinutesAgo` Float64
) ENGINE MergeTree ORDER BY partnerId;

CREATE VIEW AttributedPartnerOrders
(
    `partnerId`         FixedString(24),
    `createdDate`       Date,
    `orderId`           Nullable(String),
    `clickedMinutesAgo` Float64
)
AS
SELECT *
FROM AttributedPartnerOrdersMaterialized;

so I'd have some minimal schema setup for this test.

I generated 1 million random rows, such as

image

10 unique partner ids, 100k unique orderIds there

running several queries such as

select count(distinct(orderId)) from AttributedPartnerOrders 
where partnerId = 'c359c87522234e459890fd88'
and createdDate >= '2023-01-01'
and createdDate <= '2023-01-31'
and clickedMinutesAgo < 1440

yielded the same result in both Metabase and ClickHouse.

I think we are missing something else here...

To help with the investigation, could you provide more details:

to speed up the process, maybe you could ping me in the ClickHouse Community Slack

dvgorbunkov commented 1 year ago

I have 4 servers with clickhouse version 22.8.4.7. Two sharded servers, each with one additional server with a replica. I have the following table schema

CREATE TABLE AttributedPartnerOrdersMaterialized_local ON CLUSTER my_cluster (
    `partnerId` FixedString(24),
    `createdDate` Date,
    `itemId` Nullable(Int64),
    `clickedMinutesAgo` Float64,
...
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/AttributedPartnerOrdersMaterialized_local', '{replica}')
PARTITION BY toYYYYMMDD(createdDate) ORDER BY (partnerId, blockId) SETTINGS index_granularity = 8192;

CREATE TABLE AttributedPartnerOrdersMaterialized ON CLUSTER my_cluster (
    `partnerId` FixedString(24),
    `createdDate` Date,
    `itemId` Nullable(Int64),
    `clickedMinutesAgo` Float64,
...
) ENGINE = Distributed('my_cluster', 'mydb', 'AttributedPartnerOrdersMaterialized_local', rand());

CREATE VIEW AttributedPartnerOrders ON CLUSTER my_cluster (
    `partnerId` FixedString(24),
    `createdDate` Date,
    `itemId` Nullable(Int64),
    `clickedMinutesAgo` Float64,
...
) AS SELECT * FROM AttributedPartnerOrdersMaterialized;

As a result, if we make a query on the Distributed table (AttributedPartnerOrdersMaterialized), we get the correct data that does not change if we repeat the query. However, if we make query view table (AttributedPartnerOrders), then we get a random number. This is still strange, but we will be satisfied with the option of transferring queries to another table.

Thanks for pointing out the data model. We have an obvious error.

slvrtrn commented 1 year ago

However, if we make query view table (AttributedPartnerOrders), then we get a random number.

is it the case for clickhouse-client as well or only Metabase?

dvgorbunkov commented 1 year ago

is it the case for clickhouse-client as well or only Metabase?

only Metabase

slvrtrn commented 1 year ago

Glad to hear that it works for you now. I will leave this issue open as we still need to investigate why it behaves awkwardly when using a View over a Distributed table.

slvrtrn commented 4 months ago

I'm closing this for now, as it is still unclear what the root cause was and if it's still an issue in the recent driver releases. If there are more details and/or minimal repro available now, feel free to re-open or create a new one.