facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.52k stars 1.16k forks source link

File name incompatibility for bucketed tables between Presto and Velox #4445

Open isadikov opened 1 year ago

isadikov commented 1 year ago

Bug description

When testing Glue catalog with Velox, I found that bucketed tables are broken in Prestissimo/Velox due to file name compatibility issues with Presto. The error is due to differences in file name format which are probably fine for regular and partitioned reads/writes but cause problems for bucketing.

Presto generates the following file names for bucketed tables: 000005_0_20230326_203358_00001_knt7i. Velox generates different names: 20230326_195144_00002_wi48b.1.0.2_0_b2d34dc8-05b1-4a09-9ece-e9632a04b7b8.

Whenever the bucketing is enabled, computePartitionUpdatesForMissingBuckets method needs to compute the missing buckets to pad the amount of existing buckets (assuming your data does not fall into all of the buckets - for this case, read on). Because Velox file names are different, the total number of buckets ends up being existing + missing > total bucket count and fails the assertion.

Note that you can generate data that would create all of the buckets or simply set hive.create-empty-bucket-files to false. This makes CREATE AS SELECT and CREATE + INSERT work for bucketed tables; however, you would encounter an issue reading such tables: File '20230328_202231_00002_kj8v5.1.0.0_0_c968fed0-0756-4235-9946- 517b0b4a62b0' does not match the standard naming pattern.

Here are the exceptions thrown depending on the value of hive.create-empty-bucket-files.

With hive.create-empty-bucket-files set to true (default) - I had to improve the assertion message to show the actual difference:

com.google.common.base.VerifyException: Bucket count mismatch, expected bucket count 7 but found existing file names 
[20230328_201513_00002_45b64.1.0.1_0_416eceb1-21bd-46b3-8dc6-c5d0964843f3] and missing file names 
[000000_0_20230328_201513_00002_45b64, 000001_0_20230328_201513_00002_45b64, 
000002_0_20230328_201513_00002_45b64, 000003_0_20230328_201513_00002_45b64, 
000004_0_20230328_201513_00002_45b64, 000005_0_20230328_201513_00002_45b64, 
000006_0_20230328_201513_00002_45b64]
    at com.google.common.base.Verify.verify(Verify.java:443)
    at com.facebook.presto.hive.HiveMetadata.computeFileNamesForMissingBuckets(HiveMetadata.java:1868)
    at com.facebook.presto.hive.HiveMetadata.computePartitionUpdatesForMissingBuckets(HiveMetadata.java:1826)
    at com.facebook.presto.hive.HiveMetadata.finishInsertInternal(HiveMetadata.java:2002)
    at com.facebook.presto.hive.HiveMetadata.finishInsert(HiveMetadata.java:1981)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.finishInsert(ClassLoaderSafeConnectorMetadata.java:452)
    at com.facebook.presto.metadata.MetadataManager.finishInsert(MetadataManager.java:858)

With hive.create-empty-bucket-files set to false:

Caused by: java.lang.RuntimeException: Hive table 'tmp_presto_glue_test_test_schemab69fd14160a5447f832499a950fb865c.
tmp_presto_glue_test_test_table638ca13efc524ec3ae076adb3a9f9629' is corrupt. File '20230328_202231_00002_kj8v5.1.0.0_0_c968fed0-0756-4235-9946-
517b0b4a62b0' does not match the standard naming pattern, and the number of files in the directory (1) does not match the 
declared bucket count (7) for partition: orderdate=1996-12-09
    at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
    at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:726)
    at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:694)
    at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
    ... 32 more

The root cause appears to be divergence between Velox file name generation and HiveWriterFactory.computeBucketedFileName. Those should probably be synchronised with Velox being a library and receiving the final file name from Presto.

You can find examples of tests that reproduce the error in https://github.com/prestodb/presto/pull/19155/files#diff-7e4397f8e1a0d93026a6cd25a11dd7e15e3c704e99ec8b7ab9eb14111f13a035R364 (work in progress).

System information

Reproducible on the latest master. The build that I was testing locally with is as follows:

Velox System Info v0.0.2 Commit: 8b883b0b04ae63181650a2fd8a97ffe798269a39 CMake Version: 3.26.1 System: Darwin-21.6.0 Arch: arm64 C++ Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ C++ Compiler Version: 14.0.0.14000029 C Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc C Compiler Version: 14.0.0.14000029 CMake Prefix Path: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk/usr;/Users/ivan/homebrew;/usr/local;/usr;/;/Users/ivan/homebrew/Cellar/cmake/3.26.1;/usr/local;/usr/X11R6;/usr/pkg;/opt;/sw;/opt/local

Relevant logs

No response

mbasmanova commented 1 year ago

CC: @gggrace14