backup confuse - Githubissues

ddddddcf commented 2 years ago

Halo，I'm confuse the backup. I do this:

create table:

CREATE TABLE default.backup( 
  `a` UInt64,
  `b` String,
  `dt` DateTime
)ENGINE = MergeTree()
PARTITION BY dt 
ORDER BY a;

first insert data

clickhouse-client --query="insert into backup select number,toString(number),now() from numbers(1000000)"

first backup

clickhouse-backup create_remote --table=default.backup 1

second insert data

clickhouse-client --query="insert into backup select number,toString(number),now() from numbers(1000000)"

increment backup

clickhouse-backup create_remote --diff-from=1 --table=default.backup 2

Then I checked that the size of the two backups is similar And I find then code "required_backup":"1" in metadata.json. So I think the backup "2" is an incremental backup.

truncate data

clickhouse-client --query="truncate table backup"

delete local and remote backup "1"

clickhouse-backup delete local 1
clickhouse-backup delete remote 1

restore data

clickhouse-backup restore 2 -d --table=default.backup

So,now I think I should get a table with 1000000 pieces of data. But I found that there are 2000000 pieces of data in the table. So I want to know how it works,thank you very much,best wish to you.

Slach commented 2 years ago

root cause is not related to clickhouse-backup

  `dt` DateTime
)ENGINE = MergeTree()
PARTITION BY dt

should replace PARTITION BY toYYYYMM(dt)

every partition is a separate directory prefix in /var/lib/clickhouse/data/db/table/<partition_prefix>_<min_block>_<max_block>_<background_merges_count>/

your schema will produce huge partitions with one row in each partitition

ddddddcf commented 2 years ago

I checked folder, only have two partitions ,because insert data completed within a second,and table divided by seconds

ddddddcf commented 2 years ago

So numbers of partition or partition itself will affect incremental backup?@Slach

Slach commented 2 years ago

Sorry, I mislead with your question Let me try to describe deeper

will produce separate paratition for each different dt value, please don't use it in production
now() it's a constant value which calculated once in query, so you should got only ONE partition and ONE part, look to SELECT * FROM system.parts WHERE table='backup' FORMAT Vertical
will create full local backup with name 1, make one hard link to one data part and upload it to remote storage
will create one more partition with different now() value
will create local full backup with name 2, make two hard links for two data parts and upload only second new part (according to --diff-from) to remote storage, require exists local backup with name 1, do you check backup sizes via clickhouse-backup list?
now you have zero rows in your table
you just delete hard links for backup '1', but /var/lib/clickhouse/backup/2/ hard links still exists
you restored full backup '2' from local disk it just make hardlinks in "detached" folder, and run ALTER TABLE .. ATTACH PART, so 2000000 rows expected.

ddddddcf commented 2 years ago

I probably understand what you mean, so can I think that if I delete the local backup 2 and use the remote backup 2 to recover the data, there will only be 1 / 2 of the data in the table

Slach commented 2 years ago

After clickhouse-backup delete remote 1 (which you use for 2 --diff-from) you will receive error during try to clickhouse-backup restore_remote 2. You can't restore 1/2 data. It have no sense.

Backup is not "remote storage" for your data. It's a snapshot which you use to disaster recovery process.

ddddddcf commented 2 years ago

OK,I fully understand,thank you very much Happy New Year!!!

Altinity / clickhouse-backup

backup confuse #352