NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 231 forks source link

[BUG] scala213 maven-source-plugin failed "Java heap space" intermittently with JDK17 #11225

Open pxLi opened 2 months ago

pxLi commented 2 months ago

Describe the bug We first saw this in rapids_scala213_nightly-dev-github run: 216

[2024-07-17T12:15:06.859Z] [INFO] --- maven-source-plugin:3.0.0:jar-no-fork (attach-source) @ rapids-4-spark-sql_2.13 ---
[2024-07-17T12:15:07.114Z] [INFO] Building jar: /home/jenkins/agent/workspace/jenkins-rapids_scala213_nightly-dev-github-216/scala2.13/sql-plugin/target/spark350/rapids-4-spark-sql_2.13-24.08.0-SNAPSHOT-sources.jar
[2024-07-17T12:15:07.114Z] [132.803s][warning][gc,alloc] pool-11-thread-7: Retried waiting for GCLocker too often allocating 1250002 words
[2024-07-17T12:15:07.114Z] [132.803s][warning][gc,alloc] pool-11-thread-6: Retried waiting for GCLocker too often allocating 1250002 words
...
[2024-07-17T12:15:13.633Z] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-source-plugin:3.0.0:jar-no-fork (attach-source) 
on project rapids-4-spark-sql_2.13: Error creating source archive: 
Problem creating jar: Execution exception: Java heap space -> [Help 1]

seems the new merged change to switch scala 213 build to use JDK17 would require some JVM opts adjustment.

If its expected, then we must give it a larger Xmx or change default GC strategy with MAVEN_OPTS or to the maven-source-plugin's JVM config directly

Steps/Code to reproduce bug maven 3.6.3 with default opts, build scala213 shims (not always repro)

[2024-07-17T10:09:58.780Z] + java -version
[2024-07-17T10:09:58.781Z] openjdk version "17.0.11" 2024-04-16
[2024-07-17T10:09:58.781Z] OpenJDK Runtime Environment (build 17.0.11+9-Ubuntu-120.04.2)
[2024-07-17T10:09:58.781Z] OpenJDK 64-Bit Server VM (build 17.0.11+9-Ubuntu-120.04.2, mixed mode, sharing)

Expected behavior Ensure the build process is reliable.

Environment details (please complete the following information)

Additional context Add any other context about the problem here.

pxLi commented 2 months ago

also cc @razajafri @gerashegalov @NvTimLiu

NvTimLiu commented 2 months ago

We've not observed the issue ever since @gerashegalov separated 'scala doc' goal from the main building process, and generating it within a new JVM process.

gerashegalov commented 2 months ago

Do we have any other known occurrences of "Retried waiting for GCLocker " in our build and test runs? We may need an issue when it is triggered by our code