databendlabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.73k stars 735 forks source link

bug: COPY INTO GCS location seems to duplicate path #16304

Open rad-pat opened 1 month ago

rad-pat commented 1 month ago

Search before asking

Version

v1.2.618-nightly

What's Wrong?

When issuing a COPY INTO command for GCS, the resulting path in GCS is duplicated

How to Reproduce?

CREATE table t1 (c1 int null);
INSERT INTO t1 values (1), (2), (3);

COPY INTO 'gcs://bucket/tables/t1'
CONNECTION = (
    CREDENTIAL = '<snip>'
)
FROM default.t1
FILE_FORMAT = (TYPE = PARQUET);

Looks in GCS, see that path is bucket/tables/t1/tables/t1

Are you willing to submit PR?

rad-pat commented 1 month ago

So it seems that including a trailing slash on the end of the path makes it behave correctly. I can include the slash, but since it always exports one or many parquet files to the location, should it not be assumed that the location is always a path, or at least that /tables/ is the path and t1 is the file(?? for the one or many files)

Works correctly:

CREATE table t1 (c1 int null);
INSERT INTO t1 values (1), (2), (3);

COPY INTO 'gcs://bucket/tables/t1/'
CONNECTION = (
    CREDENTIAL = '<snip>'
)
FROM default.t1
FILE_FORMAT = (TYPE = PARQUET);
youngsofun commented 1 month ago

@rad-pat thank you. it is bug.

rad-pat commented 1 month ago

@youngsofun , presume this is fixed now with #16321?

Was this affecting internal storage if GCS is used, or would that have remained unaffected?

youngsofun commented 1 month ago

it should have been fixed, please have a try

rad-pat commented 1 month ago

Yes, seems fixed for COPY INTO, thanks. I just wondered if there was any effect to the parquet files stored by the system whilst this bug was happening?

youngsofun commented 1 month ago

The behavior of the bug is as follows:

If your location string does not end with a /, copying into bucket/<path> will result in bucket/<path>/<path>/<file_name_containing_uuid> instead of bucket/<path>/<file_name_containing_uuid>

While itโ€™s unfortunate to make this mistake, I donโ€™t think itโ€™s a major issue in practice, especially if you are only using it for unloading. The additional <path>/ can be considered part of the randomly generated path created by Databend.