apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.34k stars 3.48k forks source link

[Java] Remove Java 8 support in Arrow v18 #38051

Closed danepitkin closed 2 months ago

danepitkin commented 11 months ago

Describe the enhancement requested

  1. Java 8 is holding back development of newer Java features. For example, the Java Platform Module System (JPMS)[1], which was introduced in Java 9.
  2. Java 8 is preventing Arrow from using latest packages/dependencies in some places. See examples[2][3][4].
  3. Arrow Java is quite stable, so Java 8 users can probably be fine pinning the Arrow dependency if they aren't interested in upgrading Java versions.
  4. Java 8 is on the decline, and is not the most used Java version in 2023[5].

[1]https://en.wikipedia.org/wiki/Java_Platform_Module_System [2]https://github.com/apache/arrow/blob/main/dev/release/verify-release-candidate.sh#L571 [3]https://github.com/apache/arrow/pull/37723#discussion_r1330578945 [4]https://github.com/apache/arrow/pull/13072#issuecomment-1731904205 [5]https://newrelic.com/sites/default/files/2023-04/new-relic-2023-state-of-the-java-ecosystem-2023-04-20.pdf

Post-upgrade tasks

Component(s)

Java

kou commented 11 months ago

The discussion thread on dev@arrow.apache.org: https://lists.apache.org/thread/s07jx58yw4mkl54t3bkggnyg0sftcrr8

davisusanibar commented 11 months ago

In addition, the following dependencies are pinned for JDK8:

danepitkin commented 11 months ago

Apache Spark has dropped support for Java 8 and 11 on the main branch (targeting a 4.0 release) https://github.com/apache/spark/pull/43005

Edit: Spark 4.0 release timeframe is 2024-06[1]

[1]https://lists.apache.org/thread/xhkgj60j361gdpywoxxz7qspp2w80ry6

danepitkin commented 11 months ago

Netty 5.0 will remove support for Java 8 https://github.com/netty/netty/pull/10650

danepitkin commented 10 months ago

The current consensus on the Arrow mailing list[1] is to postpone Java 8 deprecation and to revisit it when Spark releases 4.0, which deprecates Java 8 (~2024-06).

[1] https://lists.apache.org/thread/kml53f81z1oskcf00xl7wlbcjssmn91g

danepitkin commented 10 months ago

Apache Derby continuously drops support for older JDK versions https://github.com/apache/arrow/pull/38813

kevingurney commented 5 months ago

My apologies!

I accidentally unpinned this issue because I thought I had pinned it just for me, by accident. I just repinned it.

danepitkin commented 5 months ago

Apache Iceberg is considering dropping java 8 support https://lists.apache.org/thread/ntrk2thvsg9tdccwd4flsdz9gg743368

danepitkin commented 5 months ago

New mailing list discussion: https://lists.apache.org/thread/65vqpmrrtpshxo53572zcv91j1lb2y8g

thisisnic commented 4 months ago

Apologies, I also unpinned it thinking this was just my GitHub view :joy:

normanj-bitquill commented 3 months ago

I've looked into this and have some notes.

Java Modules

When compiling Java code in Java 9 or higher, you can use both the classpath and the module-path.

Maven with Java Modules

Maven may choose to use both the classpath and module-path.

Getting Started

A first step migrating to Java 11 would be to remove (or hide) the module-info.java files. This would cause Maven to put everything on the classpath and not cause any build issues. We would not be distributing any module information, so consumers would have to treat Arrow modules as either automatic Java modules or put them on the classpath.

Without the module-info.java files, IntelliJ can easily resolve dependencies and is able to run unit tests.

Longer Term

Longer term, we should include proper module-info.java files in all Arrow modules. Not all of Arrow's dependencies have a module-info.java file, such as flatbuffers-java. It is not reliable to treat these as automatic Java modules during build, since that depends on the file name. We could either shade in the java classes or keep such dependencies on the classpath. If they are on the classpath, then we cannot declare any dependency on them in the module-info.java file and consumers may need extra flags when compiling/running projects depending on Arrow.

I recommend shading in legacy dependencies. This ease the burden for consumers of Arrow libraries. We would not expose packages from those libaries. Consumers can simply add Arrow libraries to the module path without needed flags to grant Arrow modules access to the UNNAMED module.

Some dependencies are obsolete, such as jsr305. We should migrate away from obsolete dependencies. The ThreadSafe annotation could have use, but it is becoming increasingly unlikely that anyone would consume it.

laurentgo commented 3 months ago

Do you know why module-info.java files were added in the first place? It seems weird to have to remove them because arrow is moving to java 9+, and I guess it could be considered as a public api breakage?

I also haven't observed any change of behavior from "Maven" based on the presence or absence of module-info.java either. Maybe it's a plugin thing? Do you have pointers?

jduo commented 3 months ago

Do you know why module-info.java files were added in the first place? It seems weird to have to remove them because arrow is moving to java 9+, and I guess it could be considered as a public api breakage?

I also haven't observed any change of behavior from "Maven" based on the presence or absence of module-info.java either. Maybe it's a plugin thing? Do you have pointers?

The module-info.java files were added to support JPMS in Arrow 17.

When running surefire and failsafe, maven will put JARs with a module-info.class file in the module-path instead of the classpath (when running >JDK8). IIRC there's an option to force using the classpath instead.

laurentgo commented 3 months ago

The module-info.java files were added to support JPMS in Arrow 17.

Arrow 16 you meant? Still why was JPMS support needed? Other projects like iceberg and parquet do not provide JPMS support. #13072 description goes over some of the supposed benefits of JPMS but nothing like a concrete issue the project is trying to solve and it seems now we are discussing removing (temporarily) JPMS support in order to move to Java 11? Something doesn't add up

normanj-bitquill commented 3 months ago

@jduo There is no option to force using the classpath. You are probably thinking of "useModulePath", which can be true or false. When you target Java 9 or higher, that only controls what happens to dependencies that do not have a module-info.java file. Maven will always use the module-path for dependencies with a module-info.java file.

normanj-bitquill commented 3 months ago

This work is intended for Arrow 18. I was looking for a way to split up the work. I am not suggesting removing a feature from Arrow for Arrow 18.

There are issues with the current module-info.java files. They are making use of automatic module names, which are based off the name of the Jar file. This is not reliable, and also needs to be fixed.

Given the sensitivity here, it looks like everything must be solved in one commit.

laurentgo commented 3 months ago

@jduo There is no option to force using the classpath. You are probably thinking of "useModulePath", which can be true or false. When you target Java 9 or higher, that only controls what happens to dependencies that do not have a module-info.java file. Maven will always use the module-path for dependencies with a module-info.java file.

But since code is tested with Java 11 and higher, doesn't it mean that this already works?

There are issues with the current module-info.java files. They are making use of automatic module names, which are based off the name of the Jar file. This is not reliable, and also needs to be fixed.

It seems to be a separate issue from this one, isn't it?

normanj-bitquill commented 3 months ago

This didn't show up yet since the target version of Java is 1.8.

The Maven compiler plugin cares about what the target version of Java is. Currently Arrow targets Java 1.8, so all libraries are placed on the classpath (even if using JDK 11). When targeting Java 9 or higher, Maven compiler plugin will start to look for "module-info.java" files and decide on whether libraries belong in the classpath or module-path.

Use of automatic modules is a separate issue, but may get higher visibility once Java 11 is the minimum for Arrow. More users may start to make use of the JPMS modules.

Switching Arrow to Java 11 is not as simple as changing only the target version of Java. That will cause the Maven compiler plugin to use of the module-path for most dependencies and exposes issues with the existing module-info.java files. I suspect that the module-info.java files were only tested at runtime (with unit tests) not at compile time since the target version of Java was always 1.8. Trying to verify this.

normanj-bitquill commented 3 months ago

I've looked into the CI builds using JDK 11. Those builds still target Java 1.8 when compiling Java code.

laurentgo commented 2 months ago

As the proof is in the pudding, I took a stab at dropping JDK 8 support and created a pull request

danepitkin commented 2 months ago

Issue resolved by pull request 43139 https://github.com/apache/arrow/pull/43139