apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
689 stars 483 forks source link

one compression block not allow to bigger than 8M #1727

Closed ljyss9 closed 9 months ago

ljyss9 commented 10 months ago

compress block header size is 3 byte to store compressed size,if length more than 8M,will overflow

dongjoon-hyun commented 10 months ago

Thank you for reporting. Could you provide your reproducible procedure, @ljyss9 ?

ljyss9 commented 10 months ago

Just use csv-import tool can reproduce it,set compression chunk size to 64M, and input data is hard to compress,once one block after compress is larger than 8M,will cause orc data wrong.

wgtmac commented 10 months ago

That's definitely something we need to fix.

dongjoon-hyun commented 10 months ago

To @ljyss9

  1. This is not a reproducible procedure.

    input data is hard to compress

  2. Do you mean that you hit orc-tool failure? Or, orc-tool succeeded but the generated file is corrupted?

    will cause orc data wrong.

deshanxiao commented 10 months ago

@ljyss9 which codec did you use? If possible, can all codecs reproduce this problem?

ljyss9 commented 9 months ago

@dongjoon-hyun orc-tool succeeded but the generated file is corrupted This picture is copy from official website, what the behavior is if the compressed block length is greater than 3 bytes. image

ljyss9 commented 9 months ago

@deshanxiao all codecs can reproduce this problem

cxzl25 commented 9 months ago

In ORC-238 (version 1.5.0), an assert for buffer size is added, and there should be no problem with buffer size exceeding 8M.

Can you provide the version and command you used? @ljyss9

https://github.com/apache/orc/blob/803e273cd5cd7eff3cb34e8595f595a278bc8823/java/core/src/java/org/apache/orc/impl/OutStream.java#L201-L213

deshanxiao commented 9 months ago

I think I understand @ljyss9‘’s issue.

Although like @cxzl25 mentioned, we have relevant restrictions on the java side. But csv-import (c++ side) still has the issue which means there is no limitation on the c++ side.

As for the condition that data must be difficult to compress. It is just to keep the original data in the block because ORC will not save compressed data once it is found that compression is not cost-effective.

Therefore, the final solution is that once the C++ side finds that the compression chunk size exceeds 8M value, it will refuse to write any data like Java side.

wgtmac commented 9 months ago

@deshanxiao Thanks for the update! Could you please provide a link to the protection code on the Java side? I think the C++ code should do the same thing.

deshanxiao commented 9 months ago

Hi @wgtmac Actually @cxzl25 has provided the code on the Java side:

https://github.com/apache/orc/blob/8f22732bfe336ece7d7bcfade3ca6d200d276f1c/java/core/src/java/org/apache/orc/impl/OutStream.java#L201-L213

wgtmac commented 9 months ago

cc @ffacs

ffacs commented 9 months ago

Thank you, @ljyss9. I will make a patch to forbid invalid compression block size in writer options later.

dongjoon-hyun commented 9 months ago

For the record, this is resolved via ORC-1602 (Apache ORC 1.9.3 and 2.0.0) for now.

dongjoon-hyun commented 9 months ago

Now, this is backported to branch-1.8 and 1.7 too.