Closed ljyss9 closed 9 months ago
Thank you for reporting. Could you provide your reproducible procedure, @ljyss9 ?
Just use csv-import tool can reproduce it,set compression chunk size to 64M, and input data is hard to compress,once one block after compress is larger than 8M,will cause orc data wrong.
That's definitely something we need to fix.
To @ljyss9
This is not a reproducible procedure.
input data is hard to compress
Do you mean that you hit orc-tool failure? Or, orc-tool
succeeded but the generated file is corrupted?
will cause orc data wrong.
@ljyss9 which codec did you use? If possible, can all codecs reproduce this problem?
@dongjoon-hyun orc-tool succeeded but the generated file is corrupted This picture is copy from official website, what the behavior is if the compressed block length is greater than 3 bytes.
@deshanxiao all codecs can reproduce this problem
In ORC-238 (version 1.5.0), an assert for buffer size is added, and there should be no problem with buffer size exceeding 8M.
Can you provide the version and command you used? @ljyss9
I think I understand @ljyss9‘’s issue.
Although like @cxzl25 mentioned, we have relevant restrictions on the java side. But csv-import (c++ side) still has the issue which means there is no limitation on the c++ side.
As for the condition that data must be difficult to compress. It is just to keep the original data in the block because ORC will not save compressed data once it is found that compression is not cost-effective.
Therefore, the final solution is that once the C++ side finds that the compression chunk size exceeds 8M value, it will refuse to write any data like Java side.
@deshanxiao Thanks for the update! Could you please provide a link to the protection code on the Java side? I think the C++ code should do the same thing.
Hi @wgtmac Actually @cxzl25 has provided the code on the Java side:
cc @ffacs
Thank you, @ljyss9. I will make a patch to forbid invalid compression block size in writer options later.
For the record, this is resolved via ORC-1602 (Apache ORC 1.9.3 and 2.0.0) for now.
Now, this is backported to branch-1.8 and 1.7 too.
compress block header size is 3 byte to store compressed size,if length more than 8M,will overflow