When testing Glue catalog with Velox, I found that bucketed tables are broken in Prestissimo/Velox due to file name compatibility issues with Presto. The error is due to differences in file name format which are probably fine for regular and partitioned reads/writes but cause problems for bucketing.
Presto generates the following file names for bucketed tables: 000005_0_20230326_203358_00001_knt7i.
Velox generates different names: 20230326_195144_00002_wi48b.1.0.2_0_b2d34dc8-05b1-4a09-9ece-e9632a04b7b8.
Whenever the bucketing is enabled, computePartitionUpdatesForMissingBuckets method needs to compute the missing buckets to pad the amount of existing buckets (assuming your data does not fall into all of the buckets - for this case, read on). Because Velox file names are different, the total number of buckets ends up being existing + missing > total bucket count and fails the assertion.
Note that you can generate data that would create all of the buckets or simply set hive.create-empty-bucket-files to false. This makes CREATE AS SELECT and CREATE + INSERT work for bucketed tables; however, you would encounter an issue reading such tables: File '20230328_202231_00002_kj8v5.1.0.0_0_c968fed0-0756-4235-9946- 517b0b4a62b0' does not match the standard naming pattern.
Here are the exceptions thrown depending on the value of hive.create-empty-bucket-files.
With hive.create-empty-bucket-files set to true (default) - I had to improve the assertion message to show the actual difference:
com.google.common.base.VerifyException: Bucket count mismatch, expected bucket count 7 but found existing file names
[20230328_201513_00002_45b64.1.0.1_0_416eceb1-21bd-46b3-8dc6-c5d0964843f3] and missing file names
[000000_0_20230328_201513_00002_45b64, 000001_0_20230328_201513_00002_45b64,
000002_0_20230328_201513_00002_45b64, 000003_0_20230328_201513_00002_45b64,
000004_0_20230328_201513_00002_45b64, 000005_0_20230328_201513_00002_45b64,
000006_0_20230328_201513_00002_45b64]
at com.google.common.base.Verify.verify(Verify.java:443)
at com.facebook.presto.hive.HiveMetadata.computeFileNamesForMissingBuckets(HiveMetadata.java:1868)
at com.facebook.presto.hive.HiveMetadata.computePartitionUpdatesForMissingBuckets(HiveMetadata.java:1826)
at com.facebook.presto.hive.HiveMetadata.finishInsertInternal(HiveMetadata.java:2002)
at com.facebook.presto.hive.HiveMetadata.finishInsert(HiveMetadata.java:1981)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.finishInsert(ClassLoaderSafeConnectorMetadata.java:452)
at com.facebook.presto.metadata.MetadataManager.finishInsert(MetadataManager.java:858)
With hive.create-empty-bucket-files set to false:
Caused by: java.lang.RuntimeException: Hive table 'tmp_presto_glue_test_test_schemab69fd14160a5447f832499a950fb865c.
tmp_presto_glue_test_test_table638ca13efc524ec3ae076adb3a9f9629' is corrupt. File '20230328_202231_00002_kj8v5.1.0.0_0_c968fed0-0756-4235-9946-
517b0b4a62b0' does not match the standard naming pattern, and the number of files in the directory (1) does not match the
declared bucket count (7) for partition: orderdate=1996-12-09
at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:726)
at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:694)
at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
... 32 more
Bug description
When testing Glue catalog with Velox, I found that bucketed tables are broken in Prestissimo/Velox due to file name compatibility issues with Presto. The error is due to differences in file name format which are probably fine for regular and partitioned reads/writes but cause problems for bucketing.
Presto generates the following file names for bucketed tables:
000005_0_20230326_203358_00001_knt7i
. Velox generates different names:20230326_195144_00002_wi48b.1.0.2_0_b2d34dc8-05b1-4a09-9ece-e9632a04b7b8
.Whenever the bucketing is enabled, computePartitionUpdatesForMissingBuckets method needs to compute the missing buckets to pad the amount of existing buckets (assuming your data does not fall into all of the buckets - for this case, read on). Because Velox file names are different, the total number of buckets ends up being
existing + missing > total bucket count
and fails the assertion.Note that you can generate data that would create all of the buckets or simply set
hive.create-empty-bucket-files
tofalse
. This makes CREATE AS SELECT and CREATE + INSERT work for bucketed tables; however, you would encounter an issue reading such tables:File '20230328_202231_00002_kj8v5.1.0.0_0_c968fed0-0756-4235-9946- 517b0b4a62b0' does not match the standard naming pattern
.Here are the exceptions thrown depending on the value of
hive.create-empty-bucket-files
.With
hive.create-empty-bucket-files
set totrue
(default) - I had to improve the assertion message to show the actual difference:With
hive.create-empty-bucket-files
set tofalse
:The root cause appears to be divergence between Velox file name generation and HiveWriterFactory.computeBucketedFileName. Those should probably be synchronised with Velox being a library and receiving the final file name from Presto.
You can find examples of tests that reproduce the error in https://github.com/prestodb/presto/pull/19155/files#diff-7e4397f8e1a0d93026a6cd25a11dd7e15e3c704e99ec8b7ab9eb14111f13a035R364 (work in progress).
System information
Reproducible on the latest master. The build that I was testing locally with is as follows:
Velox System Info v0.0.2 Commit: 8b883b0b04ae63181650a2fd8a97ffe798269a39 CMake Version: 3.26.1 System: Darwin-21.6.0 Arch: arm64 C++ Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ C++ Compiler Version: 14.0.0.14000029 C Compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc C Compiler Version: 14.0.0.14000029 CMake Prefix Path: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk/usr;/Users/ivan/homebrew;/usr/local;/usr;/;/Users/ivan/homebrew/Cellar/cmake/3.26.1;/usr/local;/usr/X11R6;/usr/pkg;/opt;/sw;/opt/local
Relevant logs
No response