citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.48k stars 664 forks source link

ERROR: no binary output function available for type theta_sketch #7633

Open biber-baek opened 3 months ago

biber-baek commented 3 months ago
PostgreSQL 15.7
Citus 12.1.-1
datasketches 1.6.0
select version();
                                                 version
---------------------------------------------------------------------------------------------------------
 PostgreSQL 15.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
explain analyze select  report_date, campaign_id, theta_sketch_get_estimate(theta_sketch_union(reach)) 
 from 
(select report_date,campaign_id, theta_sketch_union(reach_ds) as reach
 from report_day where service_account_id='599267'
 group by report_date, campaign_id,representative_id
 ) a
 group by report_date, campaign_id
order by  campaign_id, report_date
 limit 10;

ERROR:  42883: no binary output function available for type theta_sketch
CONTEXT:  while executing command on postgres-citus09:5432
LOCATION:  ReportResultError, remote_commands.c:324
Time: 1248.635 ms (00:01.249)
explain select  report_date, campaign_id, theta_sketch_get_estimate(theta_sketch_union(reach)) 
 from 
(select report_date,campaign_id, theta_sketch_union(reach_ds) as reach
 from report_day where service_account_id='599267'
 group by report_date, campaign_id,representative_id
 ) a
 group by report_date, campaign_id
order by  campaign_id, report_date
 limit 10;

                                                                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=757.32..757.35 rows=10 width=20)
   ->  Sort  (cost=757.32..757.82 rows=200 width=20)
         Sort Key: remote_scan.campaign_id, remote_scan.report_date
         ->  HashAggregate  (cost=750.00..753.00 rows=200 width=20)
               Group Key: remote_scan.campaign_id, remote_scan.report_date
               ->  Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=100000 width=44)
                     Task Count: 128
                     Tasks Shown: One of 128
                     ->  Task
                           Node: host=postgres-citus02 port=5432 dbname=citus_poc
                           ->  Subquery Scan on a  (cost=97482.75..108129.48 rows=31046 width=44)
                                 ->  Finalize GroupAggregate  (cost=97482.75..107819.02 rows=31046 width=52)
                                       Group Key: report_day.report_date, report_day.campaign_id, report_day.representative_id
                                       ->  Gather Merge  (cost=97482.75..106654.80 rows=62092 width=52)
                                             Workers Planned: 2
                                             ->  Partial GroupAggregate  (cost=96482.73..98487.82 rows=31046 width=52)
                                                   Group Key: report_day.report_date, report_day.campaign_id, report_day.representative_id
                                                   ->  Sort  (cost=96482.73..96806.13 rows=129361 width=147)
                                                         Sort Key: report_day.report_date, report_day.campaign_id, report_day.representative_id
                                                         ->  Parallel Append  (cost=309.83..78549.31 rows=129361 width=147)
                                                               ->  Parallel Bitmap Heap Scan on report_day_p20240606_104824 report_day_1  (cost=372.08..5512.13 rows=15365 width=147)
                                                                     Recheck Cond: (service_account_id = '599267'::bigint)
                                                                     ->  Bitmap Index Scan on report_day_p20240606_service_account_id_report_date_idx_104824  (cost=0.00..362.86 rows=36875 width=0)
                                                                           Index Cond: (service_account_id = '599267'::bigint)
                                                               ->  Parallel Bitmap Heap Scan on report_day_p20240608_105080 report_day_3  (cost=309.83..5381.77 rows=12555 width=147)
                                                                     Recheck Cond: (service_account_id = '599267'::bigint)
                                                                     ->  Bitmap Index Scan on report_day_p20240608_service_account_id_report_date_idx_105080  (cost=0.00..302.29 rows=30133 width=0)
                                                                           Index Cond: (service_account_id = '599267'::bigint)
....

Hi.

When using the datasketches parallel aggregate function on distributed tables in Citus, an error occurs as seen in 'explain analyze'. However, if I run a query without an analyze or use only 'explain', it works fine.

onurctirtir commented 3 months ago

Hey @biber-baek,

Could you share the commands you used to create this distributed table so that I can try reproducing this on my end?

Any commands you used to create & distribute the table and to create the underlying column types would help a lot.

Thanks!

biber-baek commented 3 months ago

@onurctirtir thanks for your reply. It is reproduced by the command below.

CREATE TABLE report_day (
    report_date timestamp without time zone NOT NULL,
    campaign_id bigint NOT NULL,
    ad_id bigint NOT NULL,
    account_id bigint NOT NULL,
    sketch_col public.theta_sketch
);

select create_distributed_table('report_day','ad_id');

insert into report_day select now() - (g%10 || 'day')::interval, g, trunc(random() * 100 + 1), g%10, (select  public.theta_sketch_build(1)) from generate_series(1, 100) g;

select report_date, public.theta_sketch_union(sketch_col) as reach
 from report_day where account_id='7'
 group by report_date;
onurctirtir commented 3 months ago

Could you please also share the definition of theta_sketch type?

biber-baek commented 3 months ago

I am using the datasketches extension to calculate approximate values in analytical work. The theta_sketch type is a type provided by datasketches.

m3hm3t commented 1 month ago
[local] citus@citus:9700-18493=# explain analyze select  report_date, campaign_id, theta_sketch_get_estimate(theta_sketch_union(reach)) 
 from 
(select report_date,campaign_id, theta_sketch_union(reach_ds) as reach
 from report_day where service_account_id='599267'
 group by report_date, campaign_id,representative_id
 ) a
 group by report_date, campaign_id
order by  campaign_id, report_date
 limit 10;
ERROR:  column "reach_ds" does not exist
LINE 3: ...elect report_date,campaign_id, theta_sketch_union(reach_ds) ...
                                                             ^
Time: 0.233 ms

[local] citus@citus:9700-18493=# explain select  report_date, campaign_id, theta_sketch_get_estimate(theta_sketch_union(reach)) 
 from 
(select report_date,campaign_id, theta_sketch_union(reach_ds) as reach
 from report_day where service_account_id='599267'
 group by report_date, campaign_id,representative_id
 ) a
 group by report_date, campaign_id
order by  campaign_id, report_date
 limit 10;
ERROR:  column "reach_ds" does not exist
LINE 3: ...elect report_date,campaign_id, theta_sketch_union(reach_ds) ...
                                                             ^
Time: 0.217 ms

@biber-baek When I attempted to reproduce the issue, I encountered the same error mentioned above. Could there be a missing step in the reproduction process?