NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[BUG] scala unit test crashed with SIGSEGV in RegularExpressionsTranspilerSuite intermittently #8418

Open jlowe opened 1 year ago

jlowe commented 1 year ago

Last night's nightly dev build crashed with a SIGSEGV in RegularExpressionTranspilerSuite when running on Spark 3.2.3. Test logs indicated it was likely during the "compare CPU and GPU: regexp find fuzz test with limited chars". hs_err pid file indicated it was during the copy of GPU data back to the host in preparation for comparing results.

razajafri commented 1 year ago

I see it happening on my pre-merge for spark 3.4.0 but it's happening for "compare CPU and GPU: replace digits".

pxLi commented 1 year ago

seems this appears more frequent than before

[2023-07-21T06:56:23.473Z] #
[2023-07-21T06:56:23.473Z] # A fatal error has been detected by the Java Runtime Environment:
[2023-07-21T06:56:23.473Z] #
[2023-07-21T06:56:23.473Z] #  SIGSEGV (0xb) at pc=0x00007fd3f782a606, pid=1093836, tid=0x00007fd402310700
[2023-07-21T06:56:23.473Z] #
[2023-07-21T06:56:23.473Z] # JRE version: OpenJDK Runtime Environment (8.0_362-b09) (build 1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09)
[2023-07-21T06:56:23.473Z] # Java VM: OpenJDK 64-Bit Server VM (25.362-b09 mixed mode linux-amd64 compressed oops)
[2023-07-21T06:56:23.473Z] # Problematic frame:
[2023-07-21T06:56:23.473Z] # J 70664 C2 ai.rapids.cudf.ColumnView.copyToHost()Lai/rapids/cudf/HostColumnVector; (1738 bytes) @ 0x00007fd3f782a606 [0x00007fd3f7829940+0xcc6]
[2023-07-21T06:56:23.473Z] #
[2023-07-21T06:56:23.473Z] # Core dump written. Default location: /home/jenkins/agent/workspace/jenkins-rapids_premerge-github-7579-ci-2/tests/core or core.1093836
[2023-07-21T06:56:23.473Z] #
[2023-07-21T06:56:23.473Z] # If you would like to submit a bug report, please visit:
[2023-07-21T06:56:23.473Z] #   http://bugreport.java.com/bugreport/crash.jsp
[2023-07-21T06:56:23.473Z] #
[2023-07-21T06:56:23.473Z] 
[2023-07-21T06:56:23.473Z] ---------------  T H R E A D  ---------------
[2023-07-21T06:56:23.473Z] 
[2023-07-21T06:56:23.473Z] Current thread (0x00007fd3fc015800):  JavaThread "ScalaTest-main-running-RegularExpressionTranspilerSuite" [_thread_in_Java, id=1093839, stack(0x00007fd401f11000,0x00007fd402311000)]
[2023-07-21T06:56:23.473Z] 
[2023-07-21T06:56:23.473Z] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000018
[2023-07-21T06:56:23.473Z] 
[2023-07-21T06:56:23.473Z] Registers:
[2023-07-21T06:56:23.473Z] RAX=0x0000000000000000, RBX=0x000000000000006d, RCX=0x000000000000000b, RDX=0x000000077ed7d048
[2023-07-21T06:56:23.473Z] RSP=0x00007fd40230c700, RBP=0x00000006c0636c20, RSI=0x000000077ed7d048, RDI=0x00000000d80c6e5c
[2023-07-21T06:56:23.473Z] R8 =0x00000000d831cc59, R9 =0x00000000d831cc6d, R10=0x0000000000000000, R11=0x0000000000000000
[2023-07-21T06:56:23.473Z] R12=0x0000000000000000, R13=0x0000000000000001, R14=0x0000000000000008, R15=0x00007fd3fc015800
[2023-07-21T06:56:23.474Z] RIP=0x00007fd3f782a606, EFLAGS=0x0000000000010206, CSGSFS=0x002b000000000033, ERR=0x0000000000000004
[2023-07-21T06:56:23.474Z]   TRAPNO=0x000000000000000e
[2023-07-21T06:56:23.474Z] 
[2023-07-21T06:56:23.474Z] Top of Stack: (sp=0x00007fd40230c700)
[2023-07-21T06:56:23.474Z] 0x00007fd40230c700:   00000006d80c6d80 00000006c0636ac8
[2023-07-21T06:56:23.474Z] 0x00007fd40230c710:   00000006c18e7668 0000000000000000
[2023-07-21T06:56:23.474Z] 0x00007fd40230c720:   00000007c0011600 00000007c041d100
[2023-07-21T06:56:23.474Z] 0x00007fd40230c730:   00000006c1c15578 0000061600000616
[2023-07-21T06:56:23.474Z] 0x00007fd40230c740:   000000077ed7cfe8 000000077ed7cf30
[2023-07-21T06:56:23.474Z] 0x00007fd40230c750:   0000000000000000 d831cecdd8382aaf
[2023-07-21T06:56:23.474Z] 0x00007fd40230c760:   0000000000000000 000000077ed7bae0
[2023-07-21T06:56:23.474Z] 0x00007fd40230c770:   00000000000003e8 00000006c18e7668
[2023-07-21T06:56:23.474Z] 0x00007fd40230c780:   d83ad047d83e87e4 00000006c1f43f20
[2023-07-21T06:56:23.474Z] 0x00007fd40230c790:   00000017e8001000 00000003d83676a6
[2023-07-21T06:56:23.474Z] 0x00007fd40230c7a0:   000000060000001e 000000077ed7bae0
[2023-07-21T06:56:23.474Z] 0x00007fd40230c7b0:   00007fd40230c7f0 00007fd40230c7e0
[2023-07-21T06:56:23.474Z] 0x00007fd40230c7c0:   00007fd2554d0938 00007fd25c5395d8
[2023-07-21T06:56:23.474Z] 0x00007fd40230c7d0:   000000077ed7cf08 00007fd3f7829960
[2023-07-21T06:56:23.474Z] 0x00007fd40230c7e0:   00000006c1c15578 0000000000000202
[2023-07-21T06:56:23.474Z] 0x00007fd40230c7f0:   000000077ed7bae0 00007fd3f781618c
[2023-07-21T06:56:23.474Z] 0x00007fd40230c800:   00007fd40230c870 0000000000000000
[2023-07-21T06:56:23.474Z] 0x00007fd40230c810:   00000000d83e87e4 00007fd3f25a0530
[2023-07-21T06:56:23.474Z] 0x00007fd40230c820:   000000077ed7bae0 000002ecd83e87e4
[2023-07-21T06:56:23.474Z] 0x00007fd40230c830:   00000047d81963c7 00000006c1f43f20
[2023-07-21T06:56:23.474Z] 0x00007fd40230c840:   00000006c0cb1e38 00000006c3fe52c8
[2023-07-21T06:56:23.474Z] 0x00007fd40230c850:   000000077ed7bb30 0000000000000000
[2023-07-21T06:56:23.474Z] 0x00007fd40230c860:   00000006c012cda8 00000000efdaf5c9
[2023-07-21T06:56:23.474Z] 0x00007fd40230c870:   000000077ed77178 00007fd3f7813ebc
[2023-07-21T06:56:23.474Z] 0x00007fd40230c880:   00007fd40230c8d0 00007fd3f07fd228
[2023-07-21T06:56:23.474Z] 0x00007fd40230c890:   000000077ed7ba90 eaad5e462c17a100
[2023-07-21T06:56:23.474Z] 0x00007fd40230c8a0:   00007fd40230c8f0 00007fd265ed8878
[2023-07-21T06:56:23.474Z] 0x00007fd40230c8b0:   00007fd3fc015800 00007fd40230fa90
[2023-07-21T06:56:23.474Z] 0x00007fd40230c8c0:   000000077ed77178 00007fd3f77cf70c
[2023-07-21T06:56:23.474Z] 0x00007fd40230c8d0:   00007fd3fc015800 0000000000000000
[2023-07-21T06:56:23.474Z] 0x00007fd40230c8e0:   000000077ed77178 00007fd3f780142c
[2023-07-21T06:56:23.474Z] 0x00007fd40230c8f0:   000000077ed7bae0 000000077ed19c90 
[2023-07-21T06:56:23.474Z] 
[2023-07-21T06:56:23.474Z] Instructions: (pc=0x00007fd3f782a606)
[2023-07-21T06:56:23.474Z] 0x00007fd3f782a5e6:   25 05 00 00 45 33 d2 83 fb 42 0f 86 37 1b 00 00
[2023-07-21T06:56:23.474Z] 0x00007fd3f782a5f6:   4c 8b 54 24 50 4c 89 54 24 18 43 c6 44 c4 52 01
[2023-07-21T06:56:23.474Z] 0x00007fd3f782a606:   4d 8b 4a 18 41 ba 70 c8 09 f8 49 c1 e2 03 4c 89
[2023-07-21T06:56:23.474Z] 0x00007fd3f782a616:   94 24 80 00 00 00 49 ba b0 ae b9 c0 06 00 00 00 
[2023-07-21T06:56:23.474Z] 
[2023-07-21T06:56:23.474Z] Register to memory mapping:
[2023-07-21T06:56:23.474Z] 
[2023-07-21T06:56:23.474Z] RAX=0x0000000000000000 is an unknown value
[2023-07-21T06:56:23.474Z] RBX=0x000000000000006d is an unknown value
[2023-07-21T06:56:23.474Z] RCX=0x000000000000000b is an unknown value
[2023-07-21T06:56:23.474Z] RDX=0x000000077ed7d048 is an oop
[2023-07-21T06:56:23.474Z] org.mockito.internal.util.concurrent.WeakConcurrentMap$LatentKey 
[2023-07-21T06:56:23.474Z]  - klass: 'org/mockito/internal/util/concurrent/WeakConcurrentMap$LatentKey'
[2023-07-21T06:56:23.474Z] RSP=0x00007fd40230c700 is pointing into the stack for thread: 0x00007fd3fc015800
[2023-07-21T06:56:23.474Z] RBP=0x00000006c0636c20 is an oop
[2023-07-21T06:56:23.474Z] org.mockito.internal.creation.bytebuddy.MockMethodAdvice 
[2023-07-21T06:56:23.474Z]  - klass: 'org/mockito/internal/creation/bytebuddy/MockMethodAdvice'
[2023-07-21T06:56:23.474Z] RSI=0x000000077ed7d048 is an oop
[2023-07-21T06:56:23.474Z] org.mockito.internal.util.concurrent.WeakConcurrentMap$LatentKey 
[2023-07-21T06:56:23.474Z]  - klass: 'org/mockito/internal/util/concurrent/WeakConcurrentMap$LatentKey'
[2023-07-21T06:56:23.474Z] RDI=0x00000000d80c6e5c is an unknown value
[2023-07-21T06:56:23.474Z] R8 =0x00000000d831cc59 is an unknown value
[2023-07-21T06:56:23.474Z] R9 =0x00000000d831cc6d is an unknown value
[2023-07-21T06:56:23.474Z] R10=0x0000000000000000 is an unknown value
[2023-07-21T06:56:23.474Z] R11=0x0000000000000000 is an unknown value
[2023-07-21T06:56:23.474Z] R12=0x0000000000000000 is an unknown value
[2023-07-21T06:56:23.474Z] R13=0x0000000000000001 is an unknown value
[2023-07-21T06:56:23.474Z] R14=0x0000000000000008 is an unknown value
[2023-07-21T06:56:23.474Z] R15=0x00007fd3fc015800 is a thread
[2023-07-21T06:56:23.474Z] 
[2023-07-21T06:56:23.474Z] 
[2023-07-21T06:56:23.474Z] Stack: [0x00007fd401f11000,0x00007fd402311000],  sp=0x00007fd40230c700,  free space=4077k
[2023-07-21T06:56:23.474Z] Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
[2023-07-21T06:56:23.474Z] J 70664 C2 ai.rapids.cudf.ColumnView.copyToHost()Lai/rapids/cudf/HostColumnVector; (1738 bytes) @ 0x00007fd3f782a606 [0x00007fd3f7829940+0xcc6]
[2023-07-21T06:56:23.474Z] J 70656 C1 com.nvidia.spark.rapids.RegularExpressionTranspilerSuite$$Lambda$13416.apply(Ljava/lang/Object;)Ljava/lang/Object; (12 bytes) @ 0x00007fd3f781618c [0x00007fd3f7815ea0+0x2ec]
[2023-07-21T06:56:23.474Z] J 17927 C2 com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object; (78 bytes) @ 0x00007fd3edfb1a54 [0x00007fd3edfb1780+0x2d4]
[2023-07-21T06:56:23.474Z] J 70648 C1 com.nvidia.spark.rapids.RegularExpressionTranspilerSuite.$anonfun$gpuContains$1(Ljava/lang/String;[ZLai/rapids/cudf/ColumnVector;)V (49 bytes) @ 0x00007fd3f781117c [0x00007fd3f7810820+0x95c]
[2023-07-21T06:56:23.474Z] J 70647 C1 com.nvidia.spark.rapids.RegularExpressionTranspilerSuite$$Lambda$13408.apply(Ljava/lang/Object;)Ljava/lang/Object; (16 bytes) @ 0x00007fd3f781210c [0x00007fd3f7811f20+0x1ec]
[2023-07-21T06:56:23.474Z] J 17927 C2 com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object; (78 bytes) @ 0x00007fd3edfb1a54 [0x00007fd3edfb1780+0x2d4]
[2023-07-21T06:56:23.474Z] J 70638 C1 com.nvidia.spark.rapids.RegularExpressionTranspilerSuite.gpuContains(Ljava/lang/String;Lscala/collection/Seq;)[Z (62 bytes) @ 0x00007fd3f780981c [0x00007fd3f7809060+0x7bc]
[2023-07-21T06:56:23.474Z] J 70654 C1 com.nvidia.spark.rapids.RegularExpressionTranspilerSuite.$anonfun$assertCpuGpuMatchesRegexpFind$2(Lcom/nvidia/spark/rapids/RegularExpressionTranspilerSuite;Lscala/collection/Seq;Lscala/Tuple2;)V (275 bytes) @ 0x00007fd3f781a8cc [0x00007fd3f781a240+0x68c]
[2023-07-21T06:56:23.474Z] J 70652 C1 com.nvidia.spark.rapids.RegularExpressionTranspilerSuite$$Lambda$13413.apply(Ljava/lang/Object;)Ljava/lang/Object; (16 bytes) @ 0x00007fd3f781594c [0x00007fd3f7815760+0x1ec]
[2023-07-21T06:56:23.474Z] J 12668 C2 scala.collection.TraversableLike$WithFilter$$Lambda$80.apply(Ljava/lang/Object;)Ljava/lang/Object; (13 bytes) @ 0x00007fd3ee3241ec [0x00007fd3ee323bc0+0x62c]
[2023-07-21T06:56:23.474Z] J 18321 C2 scala.collection.mutable.ArrayBuffer.foreach(Lscala/Function1;)V (6 bytes) @ 0x00007fd3ede9f428 [0x00007fd3ede9f3a0+0x88]
[2023-07-21T06:56:23.474Z] J 34962 C2 scala.collection.TraversableLike$WithFilter.foreach(Lscala/Function1;)V (17 bytes) @ 0x00007fd3f220992c [0x00007fd3f2209860+0xcc]
[2023-07-21T06:56:23.474Z] j  com.nvidia.spark.rapids.RegularExpressionTranspilerSuite.assertCpuGpuMatchesRegexpFind(Lscala/collection/Seq;Lscala/collection/Seq;)V+36
[2023-07-21T06:56:23.475Z] j  com.nvidia.spark.rapids.RegularExpressionTranspilerSuite.doFuzzTest(Lscala/Option;Lcom/nvidia/spark/rapids/RegexMode;)V+357
[2023-07-21T06:56:23.475Z] j  com.nvidia.spark.rapids.RegularExpressionTranspilerSuite.$anonfun$new$93(Lcom/nvidia/spark/rapids/RegularExpressionTranspilerSuite;)V+19
[2023-07-21T06:56:23.475Z] j  com.nvidia.spark.rapids.RegularExpressionTranspilerSuite$$Lambda$1815.apply$mcV$sp()V+4
[2023-07-21T06:56:23.475Z] J 24113 C2 scala.runtime.java8.JFunction0$mcV$sp.apply()Ljava/lang/Object; (10 bytes) @ 0x00007fd3f00eb19c [0x00007fd3f00eb160+0x3c]
[2023-07-21T06:56:23.475Z] J 61816 C1 org.scalatest.OutcomeOf.outcomeOf(Lscala/Function0;)Lorg/scalatest/Outcome; (138 bytes) @ 0x00007fd3f629250c [0x00007fd3f6292400+0x10c]
[2023-07-21T06:56:23.475Z] J 61812 C1 org.scalatest.Transformer.apply()Ljava/lang/Object; (5 bytes) @ 0x00007fd3f408285c [0x00007fd3f4082660+0x1fc]
[2023-07-21T06:56:23.475Z] J 62490 C1 org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply()Lorg/scalatest/Outcome; (19 bytes) @ 0x00007fd3f64abc84 [0x00007fd3f64aba60+0x224]
[2023-07-21T06:56:23.475Z] J 59622 C1 org.scalatest.funsuite.AnyFunSuite.withFixture(Lorg/scalatest/TestSuite$NoArgTest;)Lorg/scalatest/Outcome; (6 bytes) @ 0x00007fd3f140e96c [0x00007fd3f140e7e0+0x18c]
[2023-07-21T06:56:23.475Z] J 62485 C1 org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(Lorg/scalatest/SuperEngine$TestLeaf;Lorg/scalatest/Args;Ljava/lang/String;)Lorg/scalatest/Outcome; (35 bytes) @ 0x00007fd3f64a3ac4 [0x00007fd3f64a3860+0x264]
[2023-07-21T06:56:23.475Z] J 59611 C1 org.scalatest.funsuite.AnyFunSuiteLike$$Lambda$1987.apply(Ljava/lang/Object;)Ljava/lang/Object; (20 bytes) @ 0x00007fd3ef0227dc [0x00007fd3ef022620+0x1bc]
[2023-07-21T06:56:23.475Z] J 59591 C1 org.scalatest.SuperEngine.runTestImpl(Lorg/scalatest/Suite;Ljava/lang/String;Lorg/scalatest/Args;ZLscala/Function1;)Lorg/scalatest/Status; (1462 bytes) @ 0x00007fd3f1680144 [0x00007fd3f167b5c0+0x4b84]
[2023-07-21T06:56:23.475Z] J 58719 C1 org.scalatest.funsuite.AnyFunSuite.runTest(Ljava/lang/String;Lorg/scalatest/Args;)Lorg/scalatest/Status; (7 bytes) @ 0x00007fd3ef445e84 [0x00007fd3ef4459a0+0x4e4]
[2023-07-21T06:56:23.475Z] J 57904 C1 org.scalatest.funsuite.AnyFunSuiteLike$$Lambda$1981.apply(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (16 bytes) @ 0x00007fd3f100786c [0x00007fd3f1007560+0x30c]
[2023-07-21T06:56:23.475Z] J 57848 C1 org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Lorg/scalatest/SuperEngine;Lorg/scalatest/Args;Lorg/scalatest/Suite;Lorg/scalatest/SuperEngine$Branch;Lscala/collection/mutable/ListBuffer;Lscala/Function2;ZLorg/scalatest/SuperEngine$Node;)Ljava/lang/Object; (580 bytes) @ 0x00007fd3f100b95c [0x00007fd3f10080c0+0x389c]
[2023-07-21T06:56:23.475Z] J 57847 C1 org.scalatest.SuperEngine$$Lambda$1982.apply(Ljava/lang/Object;)Ljava/lang/Object; (36 bytes) @ 0x00007fd3ef1c062c [0x00007fd3ef1c0480+0x1ac]
[2023-07-21T06:56:23.475Z] J 17218 C2 scala.collection.immutable.List.foreach(Lscala/Function1;)V (32 bytes) @ 0x00007fd3eedbeb6c [0x00007fd3eedbeac0+0xac]
[2023-07-21T06:56:23.475Z] j  org.scalatest.SuperEngine.traverseSubNodes$1(Lorg/scalatest/SuperEngine$Branch;Lorg/scalatest/Args;Lorg/scalatest/Suite;Lscala/collection/mutable/ListBuffer;Lscala/Function2;Z)V+22
[2023-07-21T06:56:23.475Z] j  org.scalatest.SuperEngine.runTestsInBranch(Lorg/scalatest/Suite;Lorg/scalatest/SuperEngine$Branch;Lorg/scalatest/Args;ZLscala/Function2;)Lorg/scalatest/Status;+191
[2023-07-21T06:56:23.475Z] j  org.scalatest.SuperEngine.runTestsImpl(Lorg/scalatest/Suite;Lscala/Option;Lorg/scalatest/Args;Lorg/scalatest/Informer;ZLscala/Function2;)Lorg/scalatest/Status;+389
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuiteLike.runTests(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+22
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuiteLike.runTests$(Lorg/scalatest/funsuite/AnyFunSuiteLike;Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuite.runTests(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.run(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+214
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.run$(Lorg/scalatest/Suite;Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(Lorg/scalatest/funsuite/AnyFunSuiteLike;Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuiteLike$$Lambda$1975.apply(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+12
[2023-07-21T06:56:23.475Z] j  org.scalatest.SuperEngine.runImpl(Lorg/scalatest/Suite;Lscala/Option;Lorg/scalatest/Args;Lscala/Function2;)Lorg/scalatest/Status;+354
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuiteLike.run(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+15
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuiteLike.run$(Lorg/scalatest/funsuite/AnyFunSuiteLike;Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.funsuite.AnyFunSuite.run(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.callExecuteOnSuite$1(Lorg/scalatest/Suite;Lorg/scalatest/Args;Lorg/scalatest/Reporter;)Lorg/scalatest/Status;+185
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.$anonfun$runNestedSuites$1(Lorg/scalatest/Args;Lscala/collection/mutable/ListBuffer;Lorg/scalatest/Reporter;Lorg/scalatest/Suite;)Ljava/lang/Object;+16
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite$$Lambda$1973.apply(Ljava/lang/Object;)Ljava/lang/Object;+16
[2023-07-21T06:56:23.475Z] J 6908 C1 scala.collection.IndexedSeqOptimized.foreach(Lscala/Function1;)V (36 bytes) @ 0x00007fd3ed69a5a4 [0x00007fd3ed69a320+0x284]
[2023-07-21T06:56:23.475Z] J 7234 C1 scala.collection.mutable.ArrayOps$ofRef.foreach(Lscala/Function1;)V (6 bytes) @ 0x00007fd3ee127c3c [0x00007fd3ee127b80+0xbc]
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.runNestedSuites(Lorg/scalatest/Args;)Lorg/scalatest/Status;+157
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.runNestedSuites$(Lorg/scalatest/Suite;Lorg/scalatest/Args;)Lorg/scalatest/Status;+2
[2023-07-21T06:56:23.475Z] j  org.scalatest.tools.DiscoverySuite.runNestedSuites(Lorg/scalatest/Args;)Lorg/scalatest/Status;+2
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.run(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+170
[2023-07-21T06:56:23.475Z] j  org.scalatest.Suite.run$(Lorg/scalatest/Suite;Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.tools.DiscoverySuite.run(Lscala/Option;Lorg/scalatest/Args;)Lorg/scalatest/Status;+3
[2023-07-21T06:56:23.475Z] j  org.scalatest.tools.SuiteRunner.run()V+188
[2023-07-21T06:56:23.475Z] j  org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Lscala/collection/immutable/Set;Lscala/collection/immutable/Set;Lorg/scalatest/DispatchReporter;Lorg/scalatest/Stopper;Lorg/scalatest/ConfigMap;Lscala/runtime/ObjectRef;Lscala/collection/immutable/Set;Lorg/scalatest/tools/SuiteConfig;)V+162
[2023-07-21T06:56:23.475Z] j  org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Lscala/collection/immutable/Set;Lscala/collection/immutable/Set;Lorg/scalatest/DispatchReporter;Lorg/scalatest/Stopper;Lorg/scalatest/ConfigMap;Lscala/runtime/ObjectRef;Lscala/collection/immutable/Set;Lorg/scalatest/tools/SuiteConfig;)Ljava/lang/Object;+12
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$$$Lambda$1971.apply(Ljava/lang/Object;)Ljava/lang/Object;+32
[2023-07-21T06:56:23.476Z] J 7966 C1 scala.collection.immutable.List.foreach(Lscala/Function1;)V (32 bytes) @ 0x00007fd3ed773404 [0x00007fd3ed773240+0x1c4]
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Lorg/scalatest/DispatchReporter;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lorg/scalatest/Stopper;Lscala/collection/immutable/Set;Lscala/collection/immutable/Set;Lorg/scalatest/ConfigMap;ZLscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Ljava/lang/ClassLoader;Lorg/scalatest/tools/RunDoneListener;ILorg/scalatest/tools/ConcurrentConfig;Lscala/Option;Lscala/collection/immutable/Set;Lorg/scalatest/time/Span;)V+1456
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/Set;Lscala/collection/immutable/Set;Lorg/scalatest/ConfigMap;ZLscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lorg/scalatest/tools/ConcurrentConfig;Lscala/Option;Lscala/collection/immutable/Set;Lorg/scalatest/time/Span;Ljava/lang/ClassLoader;Lorg/scalatest/DispatchReporter;)V+49
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/Set;Lscala/collection/immutable/Set;Lorg/scalatest/ConfigMap;ZLscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lscala/collection/immutable/List;Lorg/scalatest/tools/ConcurrentConfig;Lscala/Option;Lscala/collection/immutable/Set;Lorg/scalatest/time/Span;Ljava/lang/ClassLoader;Lorg/scalatest/DispatchReporter;)Ljava/lang/Object;+32
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$$$Lambda$61.apply(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+72
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Lscala/collection/immutable/List;Lorg/scalatest/tools/ReporterConfigurations;Lscala/Option;Lscala/Option;ZJJLscala/Function2;)V+44
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter([Ljava/lang/String;Z)Z+2023
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner$.main([Ljava/lang/String;)V+154
[2023-07-21T06:56:23.476Z] j  org.scalatest.tools.Runner.main([Ljava/lang/String;)V+4
[2023-07-21T06:56:23.476Z] v  ~StubRoutines::call_stub
[2023-07-21T06:56:23.476Z] V  [libjvm.so+0x6a5da5]
[2023-07-21T06:56:23.476Z] V  [libjvm.so+0x72735d]
[2023-07-21T06:56:23.476Z] V  [libjvm.so+0x729f2e]
[2023-07-21T06:56:23.476Z] C  [libjli.so+0x4802]
[2023-07-21T06:56:23.476Z] C  [libjli.so+0x8dc1]
[2023-07-21T06:56:23.476Z] C  [libpthread.so.0+0x8609]  start_thread+0xd9
[2023-07-21T06:56:23.476Z] 
[2023-07-21T06:56:23.476Z] 
tgravescs commented 1 year ago

saw this yesterday on our nightly tests:

[2023-09-26T12:44:09.931Z] ^[[32m- test simple OOM split and retry^[[0m
[2023-09-26T12:44:10.040Z] #
[2023-09-26T12:44:10.040Z] # A fatal error has been detected by the Java Runtime Environment:
[2023-09-26T12:44:10.040Z] #
[2023-09-26T12:44:10.040Z] #  SIGSEGV (0xb) at pc=0x00007fd3a7160369, pid=1996, tid=0x00007fd3b1190700
[2023-09-26T12:44:10.040Z] #
[2023-09-26T12:44:10.040Z] # JRE version: OpenJDK Runtime Environment (8.0_382-b05) (build 1.8.0_382-8u382-ga-1~20.04.1-b05)
[2023-09-26T12:44:10.040Z] # Java VM: OpenJDK 64-Bit Server VM (25.382-b05 mixed mode linux-amd64 compressed oops)
[2023-09-26T12:44:10.040Z] # Problematic frame:
[2023-09-26T12:44:10.040Z] # J 69312 C2 ai.rapids.cudf.ColumnView.copyToHost(Lai/rapids/cudf/HostMemoryAllocator;)Lai/rapids/cudf/HostColumnVector; (1913 bytes) @ 0x00007fd3a7160369 [0x00007fd3a715f6a0+0xcc9]
[2023-09-26T12:44:10.040Z] #
[2023-09-26T12:44:10.040Z] # Core dump written. Default location: /home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-938-w2-938/tests/core or core.1996
[2023-09-26T12:44:10.040Z] #
pxLi commented 3 months ago

We saw another occurrence in rapids_nightly-dev-github run:1169 (only failed 330 UT)

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f40df860439, pid=144886, tid=0x00007f407adfb700
#
# JRE version: OpenJDK Runtime Environment (8.0_412-b08) (build 1.8.0_412-8u412-ga-1~20.04.1-b08)
# Java VM: OpenJDK 64-Bit Server VM (25.412-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 76154 C2 ai.rapids.cudf.ColumnView.copyToHost(Lai/rapids/cudf/HostMemoryAllocator;)Lai/rapids/cudf/HostColumnVector; (1913 bytes) @ 0x00007f40df860439 [0x00007f40df85f7c0+0xc79]
#
# Core dump written. Default location: /home/jenkins/agent/workspace/jenkins-rapids_nightly-dev-github-1169-w0-1169/tests/core or core.144886
Register to memory mapping:

RAX=0x0000000000000000 is an unknown value
RBX=0x0000000000000000 is an unknown value
RCX=0x0000000000000003 is an unknown value
RDX=0x00000007a40a4010 is an oop
org.mockito.internal.util.concurrent.WeakConcurrentMap$LatentKey 
 - klass: 'org/mockito/internal/util/concurrent/WeakConcurrentMap$LatentKey'
RSP=0x00007f407adf9ad0 is pointing into the stack for thread: 0x00007f4014009800
RBP=0x00000006c14c7100 is an oop
org.mockito.internal.creation.bytebuddy.MockMethodAdvice 
 - klass: 'org/mockito/internal/creation/bytebuddy/MockMethodAdvice'
RSI=0x00000007a40a4010 is an oop
org.mockito.internal.util.concurrent.WeakConcurrentMap$LatentKey 
 - klass: 'org/mockito/internal/util/concurrent/WeakConcurrentMap$LatentKey'
RDI=0x00000000d8253371 is an unknown value
R8 =0x00000006c13c4bb8 is an oop
java.util.RegularEnumSet 
 - klass: 'java/util/RegularEnumSet'
R9 =0x00000000d8278892 is an unknown value
R10=0x00000000d825fca8 is an unknown value
R11=0x0000000000000000 is an unknown value
R12=0x0000000000000000 is an unknown value
R13=0x0000000000000036 is an unknown value
R14=0x00007f407adf9ab0 is pointing into the stack for thread: 0x00007f4014009800
R15=0x00007f4014009800 is a thread

Stack: [0x00007f407a9fb000,0x00007f407adfc000],  sp=0x00007f407adf9ad0,  free space=4090k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
J 76154 C2 ai.rapids.cudf.ColumnView.copyToHost(Lai/rapids/cudf/HostMemoryAllocator;)Lai/rapids/cudf/HostColumnVector; (1913 bytes) @ 0x00007f40df860439 [0x00007f40df85f7c0+0xc79]
J 72703 C1 ai.rapids.cudf.ColumnView.copyToHost()Lai/rapids/cudf/HostColumnVector; (121 bytes) @ 0x00007f40de927a0c [0x00007f40de927020+0x9ec]
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$6(Lai/rapids/cudf/ColumnVector;Lai/rapids/cudf/ColumnVector;)I+8
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$6$adapted(Lai/rapids/cudf/ColumnVector;Lai/rapids/cudf/ColumnVector;)Ljava/lang/Object;+6
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator$$Lambda$7297.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
J 18641 C2 com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object; (78 bytes) @ 0x00007f40d709695c [0x00007f40d70962a0+0x6bc]
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$mergeSortEnoughToOutput$2(Lcom/nvidia/spark/rapids/GpuOutOfCoreSortIterator;Lcom/nvidia/spark/rapids/SpillableColumnarBatch;Lcom/nvidia/spark/rapids/SpillableColumnarBatch;)Lscala/Tuple2;+91
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator$$Lambda$6322.apply(Ljava/lang/Object;)Ljava/lang/Object;+12
J 49116 C2 com.nvidia.spark.rapids.Arm$.closeOnExcept(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object; (100 bytes) @ 0x00007f40dc9c2a88 [0x00007f40dc9c2a20+0x68]
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.mergeSortEnoughToOutput()Lscala/Option;+140
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$next$5(Lcom/nvidia/spark/rapids/GpuOutOfCoreSortIterator;Lcom/nvidia/spark/rapids/NvtxWithMetrics;)Lorg/apache/spark/sql/vectorized/ColumnarBatch;+5
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator$$Lambda$6306.apply(Ljava/lang/Object;)Ljava/lang/Object;+8
J 18641 C2 com.nvidia.spark.rapids.Arm$.withResource(Ljava/lang/AutoCloseable;Lscala/Function1;)Ljava/lang/Object; (78 bytes) @ 0x00007f40d709695c [0x00007f40d70962a0+0x6bc]
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next()Lorg/apache/spark/sql/vectorized/ColumnarBatch;+195
j  com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next()Ljava/lang/Object;+5
j  org.apache.spark.sql.rapids.GpuFileFormatDataWriter.writeWithIterator(Lscala/collection/Iterator;)V+119
j  org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$executeTask$1(Lorg/apache/spark/sql/rapids/GpuFileFormatDataWriter;Lscala/collection/Iterator;)Lorg/apache/spark/sql/execution/datasources/WriteTaskResult;+6
j  org.apache.spark.sql.rapids.GpuFileFormatWriter$$$Lambda$8160.apply()Ljava/lang/Object;+8
j  org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Lscala/Function0;Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+4
j  org.apache.spark.sql.rapids.GpuFileFormatWriter$.executeTask(Lorg/apache/spark/sql/rapids/GpuWriteJobDescription;Ljava/lang/String;IIILorg/apache/spark/internal/io/FileCommitProtocol;Lscala/collection/Iterator;Lscala/Option;)Lorg/apache/spark/sql/execution/datasources/WriteTaskResult;+490
j  org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$write$14(Lorg/apache/spark/sql/rapids/GpuWriteJobDescription;Ljava/lang/String;Lorg/apache/spark/internal/io/FileCommitProtocol;Lscala/Option;Lorg/apache/spark/TaskContext;Lscala/collection/Iterator;)Lorg/apache/spark/sql/execution/datasources/WriteTaskResult;+62
j  org.apache.spark.sql.rapids.GpuFileFormatWriter$$$Lambda$8157.apply(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+24
J 61462 C2 org.apache.spark.scheduler.ResultTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object; (212 bytes) @ 0x00007f40d9448d24 [0x00007f40d94482c0+0xa64]
J 59735 C2 org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;ILscala/collection/immutable/Map;Lscala/Option;)Ljava/lang/Object; (519 bytes) @ 0x00007f40ddf68d5c [0x00007f40ddf65ba0+0x31bc]
J 59686 C2 org.apache.spark.executor.Executor$TaskRunner$$Lambda$5231.apply()Ljava/lang/Object; (12 bytes) @ 0x00007f40d5ee35d0 [0x00007f40d5ee3540+0x90]
J 43302 C2 org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object; (207 bytes) @ 0x00007f40d7ae0d08 [0x00007f40d7ae0cc0+0x48]
J 59567 C2 org.apache.spark.executor.Executor$TaskRunner.run()V (2942 bytes) @ 0x00007f40dde87550 [0x00007f40dde85640+0x1f10]
J 59454 C2 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @ 0x00007f40d655fe48 [0x00007f40d655fc20+0x228]
J 70615 C2 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 0x00007f40d5403d24 [0x00007f40d5403ce0+0x44]
J 67062 C2 java.lang.Thread.run()V (17 bytes) @ 0x00007f40d57f2cec [0x00007f40d57f2ca0+0x4c]
v  ~StubRoutines::call_stub
V  [libjvm.so+0x6a5c95]
V  [libjvm.so+0x6a33df]
V  [libjvm.so+0x6a39d4]
V  [libjvm.so+0x749970]
V  [libjvm.so+0xadd12f]
V  [libjvm.so+0xadd423]
V  [libjvm.so+0x97e8a0]

hs_err_pid144886.log

revans2 commented 2 months ago

I just hit this on a per-merge test.

sameerz commented 2 months ago

Saw another instance of this failure in the nightly pipeline rapids_nightly-dev-github.