Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.29k stars 227 forks source link

backup confuse #352

Closed ddddddcf closed 2 years ago

ddddddcf commented 2 years ago

Halo,I'm confuse the backup. I do this:

  1. create table:
CREATE TABLE default.backup( 
  `a` UInt64,
  `b` String,
  `dt` DateTime
)ENGINE = MergeTree()
PARTITION BY dt 
ORDER BY a;
  1. first insert data

    clickhouse-client --query="insert into backup select number,toString(number),now() from numbers(1000000)"
  2. first backup

clickhouse-backup create_remote --table=default.backup 1
  1. second insert data
clickhouse-client --query="insert into backup select number,toString(number),now() from numbers(1000000)"
  1. increment backup
clickhouse-backup create_remote --diff-from=1 --table=default.backup 2

Then I checked that the size of the two backups is similar And I find then code "required_backup":"1" in metadata.json. So I think the backup "2" is an incremental backup.

  1. truncate data
    clickhouse-client --query="truncate table backup"
  2. delete local and remote backup "1"
    clickhouse-backup delete local 1
    clickhouse-backup delete remote 1
  3. restore data
    clickhouse-backup restore 2 -d --table=default.backup

So,now I think I should get a table with 1000000 pieces of data. But I found that there are 2000000 pieces of data in the table. So I want to know how it works,thank you very much,best wish to you.

Slach commented 2 years ago

root cause is not related to clickhouse-backup

  `dt` DateTime
)ENGINE = MergeTree()
PARTITION BY dt 

should replace PARTITION BY toYYYYMM(dt)

every partition is a separate directory prefix in /var/lib/clickhouse/data/db/table/<partition_prefix>_<min_block>_<max_block>_<background_merges_count>/

your schema will produce huge partitions with one row in each partitition

ddddddcf commented 2 years ago

I checked folder, only have two partitions ,because insert data completed within a second,and table divided by seconds

ddddddcf commented 2 years ago

So numbers of partition or partition itself will affect incremental backup?@Slach

Slach commented 2 years ago

Sorry, I mislead with your question Let me try to describe deeper

  1. will produce separate paratition for each different dt value, please don't use it in production
  2. now() it's a constant value which calculated once in query, so you should got only ONE partition and ONE part, look to SELECT * FROM system.parts WHERE table='backup' FORMAT Vertical
  3. will create full local backup with name 1, make one hard link to one data part and upload it to remote storage
  4. will create one more partition with different now() value
  5. will create local full backup with name 2, make two hard links for two data parts and upload only second new part (according to --diff-from) to remote storage, require exists local backup with name 1, do you check backup sizes via clickhouse-backup list?
  6. now you have zero rows in your table
  7. you just delete hard links for backup '1', but /var/lib/clickhouse/backup/2/ hard links still exists
  8. you restored full backup '2' from local disk it just make hardlinks in "detached" folder, and run ALTER TABLE .. ATTACH PART, so 2000000 rows expected.
ddddddcf commented 2 years ago

I probably understand what you mean, so can I think that if I delete the local backup 2 and use the remote backup 2 to recover the data, there will only be 1 / 2 of the data in the table

Slach commented 2 years ago

After clickhouse-backup delete remote 1 (which you use for 2 --diff-from) you will receive error during try to clickhouse-backup restore_remote 2. You can't restore 1/2 data. It have no sense.

Backup is not "remote storage" for your data. It's a snapshot which you use to disaster recovery process.

ddddddcf commented 2 years ago

OK,I fully understand,thank you very much Happy New Year!!!