bug: COPY INTO GCS location seems to duplicate path

databendlabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com

https://docs.databend.com

Other

7.73k stars 735 forks source link

bug: COPY INTO GCS location seems to duplicate path #16304

Open rad-pat opened 1 month ago

rad-pat commented 1 month ago

Search before asking

[X] I had searched in the issues and found no similar issues.

Version

v1.2.618-nightly

What's Wrong?

When issuing a COPY INTO command for GCS, the resulting path in GCS is duplicated

How to Reproduce?

CREATE table t1 (c1 int null);
INSERT INTO t1 values (1), (2), (3);

COPY INTO 'gcs://bucket/tables/t1'
CONNECTION = (
    CREDENTIAL = '<snip>'
)
FROM default.t1
FILE_FORMAT = (TYPE = PARQUET);

Looks in GCS, see that path is bucket/tables/t1/tables/t1

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

rad-pat commented 1 month ago

So it seems that including a trailing slash on the end of the path makes it behave correctly. I can include the slash, but since it always exports one or many parquet files to the location, should it not be assumed that the location is always a path, or at least that /tables/ is the path and t1 is the file(?? for the one or many files)

Works correctly:

CREATE table t1 (c1 int null);
INSERT INTO t1 values (1), (2), (3);

COPY INTO 'gcs://bucket/tables/t1/'
CONNECTION = (
    CREDENTIAL = '<snip>'
)
FROM default.t1
FILE_FORMAT = (TYPE = PARQUET);

youngsofun commented 1 month ago

@rad-pat thank you. it is bug.

rad-pat commented 1 month ago

@youngsofun , presume this is fixed now with #16321?

Was this affecting internal storage if GCS is used, or would that have remained unaffected?

youngsofun commented 1 month ago

it should have been fixed, please have a try

rad-pat commented 1 month ago

Yes, seems fixed for COPY INTO, thanks. I just wondered if there was any effect to the parquet files stored by the system whilst this bug was happening?

youngsofun commented 1 month ago

The behavior of the bug is as follows:

If your location string does not end with a /, copying into bucket/<path> will result in bucket/<path>/<path>/<file_name_containing_uuid> instead of bucket/<path>/<file_name_containing_uuid>

While it’s unfortunate to make this mistake, I don’t think it’s a major issue in practice, especially if you are only using it for unloading. The additional <path>/ can be considered part of the randomly generated path created by Databend.