Altinity / clickhouse-backup

Tool for easy backup and restore for ClickHouse® using object storage for backup files.
https://altinity.com
Other
1.29k stars 227 forks source link

Add to documentation difference between download/upload_concurrency ans s3/ftp concurrency #387

Closed vanyasvl closed 2 years ago

vanyasvl commented 2 years ago

Hello. Clickhouse-backup has options download_concurrency/upload_concurrency in the general section and concurrency in the s3 section. What the difference? Should thy match? If I set upload_concurrency to 10 and s3 concurrency to 1, how many uploads will be in parallel, 1 or 10?

Thanks

Slach commented 2 years ago

upload_concurrency / download concurrency define how much parallel download / upload go-routines will start independent of remote storage type. on 1.3.0 it mean how much parallel data parts will upload

concurrency in s3 section mean how much concurrent upload stream will run during multipart upload in each upload go-routine

high S3_CONCURRENCY + high S3_PART_SIZE will allocate high memory for buffers inside AWS golang SDK

Slach commented 2 years ago

If I set upload_concurrency to 10 and s3 concurrency to 1, how many uploads will be in parallel, 1 or 10? 10 parallel uploads, and each upload will restrict on S3 as only one multipart upload stream

moreover, I recommends to use compression_type: tar to avoid allocate high CPU usage

vanyasvl commented 2 years ago

Thanks. So what is the correct way to increase upload speed, and not to use too much memory? In 1.3.0. with upload_concurrency 10 and s3 concurrency 1 it's even not enough 64Gb ram on server. With upload_concurrency 5 and s3 concurrency 1 clickhouse-backup utilize about 40Gb ram. We use part_size 1gb and max_file_size 1gb too. So may be better to decrease part_size to 100m and set s3 concurrency to 10?

Slach commented 2 years ago

did you define s3->part_size? or general -> max_file_size?

could you run clickhouse-backup print-config and share your current config without sensitive credentials?

vanyasvl commented 2 years ago

Yes, I define part_size, because our storage (Swift with s3 middleware) doesn't support more than 1000 parts per object

general:
  remote_storage: s3
  max_file_size: 1073741824
  disable_progress_bar: false
  backups_to_keep_local: 1
  backups_to_keep_remote: 30
  log_level: debug
  allow_empty_backups: false
  download_concurrency: 10
  upload_concurrency: 5
  restore_schema_on_cluster: ""
  upload_by_part: true
  download_by_part: true
clickhouse:
  username: default
  password: ""
  host: 127.0.0.1
  port: 9000
  disk_mapping: {}
  skip_tables:
  - system.*
  - default.*
  - INFORMATION_SCHEMA.*
  - information_schema.*
  timeout: 5m
  freeze_by_part: false
  secure: false
  skip_verify: false
  sync_replicated_tables: false
  log_sql_queries: false
  config_dir: /etc/clickhouse-server/
  restart_command: systemctl restart clickhouse-server
  ignore_not_exists_error_during_freeze: true
  debug: false
s3:
  ....
  region: us-east-1
  acl: private
  assume_role_arn: ""
  force_path_style: true
  path: ""
  disable_ssl: false
  compression_level: 1
  compression_format: tar
  sse: ""
  disable_cert_verification: false
  storage_class: STANDARD
  concurrency: 1
  part_size: 1073741824
  debug: false
gcs:
  credentials_file: ""
  credentials_json: ""
  bucket: ""
  path: ""
  compression_level: 1
  compression_format: tar
  debug: false
  endpoint: ""
cos:
  url: ""
  timeout: 2m
  secret_id: ""
  secret_key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  debug: false
api:
  listen: 0.0.0.0:7171
  enable_metrics: true
  enable_pprof: false
  username: ""
  password: ""
  secure: false
  certificate_file: ""
  private_key_file: ""
  create_integration_tables: false
  allow_parallel: false
ftp:
  address: ""
  timeout: 2m
  username: ""
  password: ""
  tls: false
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 28
  debug: false
sftp:
  address: ""
  port: 22
  username: ""
  password: ""
  key: ""
  path: ""
  compression_format: tar
  compression_level: 1
  concurrency: 1
  debug: false
azblob:
  endpoint_suffix: core.windows.net
  account_name: ""
  account_key: ""
  sas: ""
  use_managed_identity: false
  container: ""
  path: ""
  compression_level: 1
  compression_format: tar
  sse_key: ""
  buffer_size: 0
  buffer_count: 3
Slach commented 2 years ago

Try to decrease s3->part_size to 52428800 instead of 1Gb?

vanyasvl commented 2 years ago

Thanks. Decreasing part_size decrease memory consumption. Will try 100mb par_size with concurrency

vanyasvl commented 2 years ago

upload_concurrency / download concurrency define how much parallel download / upload go-routines will start independent of remote storage type. on 1.3.0 it mean how much parallel data parts will upload

concurrency in s3 section mean how much concurrent upload stream will run during multipart upload in each upload go-routine

high S3_CONCURRENCY + high S3_PART_SIZE will allocate high memory for buffers inside AWS golang SDK

Could you add it to the documentation?

Slach commented 2 years ago

Could you add it to the documentation?

You are welcome to make pull request