citusdata / postgresql-hll

PostgreSQL extension adding HyperLogLog data structures as a native data type
http://blog.aggregateknowledge.com/2013/02/04/open-source-release-postgresql-hll/
Apache License 2.0
1.12k stars 116 forks source link

Question : table consume disk space #153

Open PBAtTheRingIo opened 11 months ago

PBAtTheRingIo commented 11 months ago

Hello, I do not know if it's the way to ask a question on this project, if not sorry about that. I'm using Debian GNU/Linux 12 (bookworm), Kernel Linux debian 6.1.0-9-amd64, PostgreSQL 15 from package 15.3-0+deb12u1, and the package postgresql-15-hll extension from package 2.17-1+b1. We would like to see the result of the cardinality after adding a lot of UUID in using the function "hll_add(hll, hll_hash_bytea(?)) with a binary uuid", and I see that my disk space is consumed by the table. Do you have any idea why, even if the HLL record is small (few kb), the result of "\d+" command and the size on disk can reach several Mb ? Do I have to configure PostgreSQL with dedicated parameters or do something else to not have this issue ? Thanks a lot in advance

emelsimsek commented 10 months ago

Can you explain your scenario a bit more? What is the data model and the commands that I can use to reproduce what you are observing?

PBAtTheRingIo commented 10 months ago

I created a simple table with 1 integer and 1 hll like for instance:

CREATE TABLE views_hll (
    item_id integer NOT NULL,
    hll hll(14,5,8,1) DEFAULT hll_empty(14, 5, (8)::bigint, 1) NOT NULL
);

then I use a loop to insert a bunch of UUIDs one by one in a loop (all uuids are different) thanks to a SQl request like this one: UPDATE "views_hll" AS v0 SET "hll" = hll_add(hll, hll_hash_bytea($1)) WHERE (v0."item_id" = $2); where $1 is a binary UUID and $2 the id (1 for instance and always 1 for each update).

I check the size after 1, 10, 100, 1_000_1000 update in doing for instance:

SELECT pg_total_relation_size('views_hll');
or
SELECT pg_size_pretty(pg_total_relation_size('views_hll'));

And I check also the value directly on my disk, and I see on the disk and thanks to the previous commands the value growing.