airlift / aircompressor

A port of Snappy, LZO, LZ4, and Zstandard to Java
Apache License 2.0
549 stars 110 forks source link

Hive fails opening split for zst compressed files #173

Open benrifkind opened 11 months ago

benrifkind commented 11 months ago

This issue is copied from https://github.com/trinodb/trino/issues/17792 since I believe this repo is where zstd de/compression is handled.

I have a Hive table built on top of zst compressed data. On a Trino 419 cluster I get the following error when trying to read this table from Trino. That version of Trino uses aircompressor 0.23.

Query 20230607_172621_00003_5hpz7 failed: Error opening Hive split s3://path/to/file.csv.access.log.zst (offset=0, length=1544108): Window size too large (not yet supported): offset=3084

We are currently running Trino 405 and this query executes without an issue. We also have been running previous versions of Trino/Presto and this executed without an issue in the past. That version of Trino uses 0.21

Did something change between aircompressor 0.21 and 0.24 that might have caused this? And is there anything I can do to get past this error? Thanks in advance for your help!

Full stack trace

io.trino.spi.TrinoException: Error opening Hive split s3://path/to/file.csv.access.log.zst (offset=0, length=1544108): Window size too large (not yet supported): offset=3084 at io.trino.plugin.hive.line.LinePageSourceFactory.createPageSource(LinePageSourceFactory.java:179) at io.trino.plugin.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:218) at io.trino.plugin.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:156) at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:48) at io.trino.split.PageSourceManager.createPageSource(PageSourceManager.java:61) at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:298) at io.trino.operator.Driver.processInternal(Driver.java:402) at io.trino.operator.Driver.lambda$process$8(Driver.java:305) at io.trino.operator.Driver.tryWithLock(Driver.java:701) at io.trino.operator.Driver.process(Driver.java:297) at io.trino.operator.Driver.processForDuration(Driver.java:268) at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:888) at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:187) at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:556) at io.trino.$gen.Trino_18f7842____20230607_162211_2.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: io.airlift.compress.MalformedInputException: Window size too large (not yet supported): offset=3084 at io.airlift.compress.zstd.Util.verify(Util.java:45) at io.airlift.compress.zstd.ZstdFrameDecompressor.decodeCompressedBlock(ZstdFrameDecompressor.java:303) at io.airlift.compress.zstd.ZstdIncrementalFrameDecompressor.partialDecompress(ZstdIncrementalFrameDecompressor.java:236) at io.airlift.compress.zstd.ZstdInputStream.read(ZstdInputStream.java:89) at io.airlift.compress.zstd.ZstdHadoopInputStream.read(ZstdHadoopInputStream.java:53) at com.google.common.io.CountingInputStream.read(CountingInputStream.java:64) at java.base/java.io.InputStream.readNBytes(InputStream.java:506) at io.trino.hive.formats.line.text.TextLineReader.fillBuffer(TextLineReader.java:248) at io.trino.hive.formats.line.text.TextLineReader.(TextLineReader.java:67) at io.trino.hive.formats.line.text.TextLineReaderFactory.createLineReader(TextLineReaderFactory.java:77) at io.trino.plugin.hive.line.LinePageSourceFactory.createPageSource(LinePageSourceFactory.java:171) ... 17 more