ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
36.8k stars 6.8k forks source link

Support for S3 VPC Interface Endpoints #53761

Closed joshbartley closed 3 months ago

joshbartley commented 1 year ago

Use case You have an on-premise Clickhouse server and have either an AWS Direct Connect connection or IPSec VPN to an AWS VPC.

Describe the solution you'd like When specifying the AWS S3 bucket location details, ability to include an Endpoint URL to support VPC Interface Endpoints. Uses would include S3 backups, S3 table engine, S3 file load, S3 Restore.

Describe alternatives you've considered AWS VPC Gateway Endpoints AWS VPC Gateway endpoints support accessing S3 directly but does not support Direct Connect or IPSec tunnels without using a public ipv4 /24 to setup the route. Because of the IPv4 /24 requirement, this is highly not recommended. [Gateway] Endpoint connections cannot be extended out of a VPC. Resources on the other side of a VPN connection, VPC peering connection, transit gateway, or AWS Direct Connect connection in your VPC cannot use a gateway endpoint to communicate with Amazon S3. https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html

AWS VPC Interface Endpoints are a private IP in the VPC which works over Direct Connect and IPSec tunnels without the need to use public IPv4 routing to access. https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html

tsolodov commented 1 year ago

Please check this doc: https://repost.aws/knowledge-center/s3-bucket-access-direct-connect

joshbartley commented 1 year ago

@tsolodov the second option of an VPC Interface doesn't work, first option is not feasible and the reason for this issue.. In the CLI you have to specify both the bucket name and the endpoint url as separate items.

If you use https://XXXXXXXXXXXXXXXX.vpce-000000000000000000-XXXXXXXXXX.s3.us-east-2.vpce.amazonaws.com/Clickhouse/XXXXXXX/Full as the S3 endpoint which uses the Endpoint URL from your link you get the error below

2023.08.25 18:10:46.357763 [ 1600 ] {} DNSResolver: Cannot resolve host (s3.us-east-2.vpce.amazonaws.com), error 0: Host not found.

Because https://github.com/ClickHouse/ClickHouse/blob/32efbe77d1ba48291d90885b11e6f1840c4158db/src/IO/S3/URI.cpp has a regex that strips the VPC Endpoint out and tries to connect to s3.us-east-2.vpce.amazonaws.com which doesn't exist.

arthurpassos commented 5 months ago

Unless I am missing something, this will be fixed by https://github.com/ClickHouse/ClickHouse/pull/62208.

I have only performed tests using simple queries to s3 with select * from s3(vpce_endpoint)...

arthurpassos commented 5 months ago

Just tested table engine, backups, incremental backups and restore. It is working as expected.