apache / polaris

Apache Polaris, the interoperable, open source catalog for Apache Iceberg
https://polaris.apache.org/
Apache License 2.0
1.13k stars 122 forks source link

[BUG] GCS buckets with underscores are resolved as null #262

Closed almazgalievisl closed 1 month ago

almazgalievisl commented 2 months ago

Is this a possible security vulnerability?

Describe the bug

Underscores is a valid symbol in the bucket name, the doc

Bucket names can only contain lowercase letters, numeric characters, dashes (-), underscores (_), and dots (.). Spaces are not allowed. Names containing dots require verification.

The current approach uses java.net.URI, which is not allowing to have underscores in the host name of URI. It leads to wrongly requested access boundaries, because the bucket name is being set as null.

To Reproduce

|  Welcome to JShell -- Version 11.0.22
|  For an introduction type: /help intro

jshell> String location = "gs://test_bucket/iceberg/data"
location ==> "gs://test_bucket/iceberg/data"

jshell> URI uri = uri.create(location);
uri ==> gs://test_bucket/iceberg/data

jshell> uri.getHost()
$3 ==> null

Actual Behavior

The bucket name is set as null

Expected Behavior

The bucket name is set as test_bucket

Additional context

I guess this class from com.google.cloud.storage can be used instead

import com.google.cloud.storage.BlobId;
...
BlobId blob = BlobId.fromGsUtilUri(location);
String bucket = blob.getBucket();
String path = blob.getName();
...

System information

Object storage: GCS

eric-maynard commented 2 months ago

Thanks for filing this! I opened a small PR which seems to resolve the issue.