apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

Write truncated parquet footer #3069

Open Zand100 opened 3 days ago

Zand100 commented 3 days ago

Describe the bug, including details regarding any error messages, version, and platform.

Sometimes a file is written that is missing the last byte, so it ends in .PAR when it should be .PAR1. This causes EOFException when attempting to read the file.

$ hexdump -C good.snappy.parquet| tail -n 10
004fff70  6b 2e 6c 65 67 61 63 79  44 61 74 65 54 69 6d 65  |k.legacyDateTime|
004fff80  18 00 00 18 4a 70 61 72  71 75 65 74 2d 6d 72 20  |....Jparquet-mr |
004fff90  76 65 72 73 69 6f 6e 20  31 2e 31 32 2e 33 20 28  |version 1.12.3 (|
004fffa0  62 75 69 6c 64 20 66 38  64 63 65 64 31 38 32 63  |build f8dced182c|
004fffb0  34 63 31 66 62 64 65 63  36 63 63 62 33 31 38 35  |4c1fbdec6ccb3185|
004fffc0  35 33 37 62 35 61 30 31  65 36 65 64 36 62 29 19  |537b5a01e6ed6b).|
004fffd0  dc 1c 00 00 1c 00 00 1c  00 00 1c 00 00 1c 00 00  |................|
004fffe0  1c 00 00 1c 00 00 1c 00  00 1c 00 00 1c 00 00 1c  |................|
004ffff0  00 00 1c 00 00 1c 00 00  00 e7 0f 00 00 50 41 52  |.............PAR|
00500000

This might be related - we are seeing this issue only on GCP, not AWS. For GCP we do disk seeks randomly and on AWS we do disk seeks sequentially.

We can rerun a job that writes the corrupt parquet file, and it will succeed the second time, so it seems to be nondeterministic.

This is on version 1.14.3.

Component(s)

No response