apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.58k stars 1.4k forks source link

PARQUET-2461: Upgrade ZSTD-JNI to 1.5.6-2 #1326

Closed cxzl25 closed 5 months ago

cxzl25 commented 5 months ago

Apr 4, 2024 https://github.com/luben/zstd-jni/releases/tag/v1.5.6-2

Mar 28, 2024 https://github.com/luben/zstd-jni/releases/tag/v1.5.6-1

Dec 1, 2023 https://github.com/luben/zstd-jni/releases/tag/v1.5.5-11

Jira

Tests

Commits

Style

Documentation

wgtmac commented 5 months ago

@gszadovszky @Fokko

We see build errors for 1.14.0-SNAPSHOT against Apache Spark branch.3.5:

[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list
[error] java.lang.RuntimeException: found version conflict(s) in library dependencies; some are suspected to be binary incompatible:
[error] 
[error]     * com.github.luben:zstd-jni:1.5.6-2 (strict) is selected over {1.5.5-4}
[error]         +- org.apache.parquet:parquet-hadoop:1.14.0-SNAPSHOT  (depends on 1.5.6-2)
[error]         +- org.apache.spark:spark-core_2.12:3.5.2-SNAPSHOT    (depends on 1.5.5-4)
[error] 

I'm not sure if this is a blocking issue for the release.

wgtmac commented 5 months ago

FYI: zstd-jni versions on different branches of Apache Spark as of 2024-04-29: master: 1.5.6-3 branch-3.5: 1.5.5-4 branch-3.4: 1.5.2-5

Fokko commented 5 months ago

@wgtmac Thanks for raising this. Looking at the history, there seem te be some important patches in there: https://github.com/luben/zstd-jni/commits/master/?before=c77a7658aeba94ccd3da52e61ce87d06e7292826+35

We can't backport this to the 3.5 branch anyway. So, we're probably good with keeping this in line with the Spark main branch. WDYT?

wgtmac commented 5 months ago

So we just need to pair 1.14.0 with Apache Spark 4.0.0? I'm not sure if this is a good idea. It looks better if Apache Spark 3.5.2 can adopt Parquet 1.14.0.

Fokko commented 5 months ago

I think they don't allow backporting dependency upgrades in Spark unless there are CVE's :(

wgtmac commented 5 months ago

I'm not sure if that is the case. For Apache ORC, we use the following mapping to pair it with Apache Spark and always update the latest minor version to Apache Spark: