apache-spark-on-k8s / spark

Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
https://spark.apache.org/
Apache License 2.0
612 stars 118 forks source link

can't find snappy lib #504

Open luck02 opened 7 years ago

luck02 commented 7 years ago

I built off the 2.2 branch. Works great until it tries to write to parquet files.

Then it can't seem to initialize / find org.xerial.snappy.Snappy. I see snappy in the jars directory of the spark distro. My endpoint is s3a:\bucket\etc and I've included the following jars.

--jars local:/opt/jars/hadoop-aws-2.7.3.jar,local:/opt/jars/aws-java-sdk-1.7.4.jar,local:/opt/jars/joda-time-2.9.3.jar,local:/opt/jars/mysql-connector-java-5.1.35.jar \

I'm not sure if this is something odd in spark distro or if I'm doing something odd.

Thanks!

I can provide the full error log output if that'll help.

2017-09-22 22:51:57 INFO  DAGScheduler:54 - ResultStage 3 (parquet at NativeMethodAccessorImpl.java:0) failed in 4.996 s due to Job aborted due to stage failure: Task 8 in stage 3.0 failed 4 times, most recent failure: Lost task 8.3 in stage 3.0 (TID 434, 10.2.106.11, executor 7): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
foxish commented 7 years ago

Botched dependencies? Hmm..

@ash211 @mccheah seen this?

luck02 commented 7 years ago
garylucas@gary-lucas:~/dev/proj/infra/spark-2.2.0-k8s-0.4.0-SNAPSHOT-bin-2.7.3/jars|DAT-spark-on-k8s⚡
⇒  ls -alh | grep snappy
-rw-r--r--    1 garylucas  staff    48K Sep 16 20:54 snappy-0.2.jar
-rw-r--r--    1 garylucas  staff   1.0M Sep 16 20:54 snappy-java-1.1.2.6.jar

I think the snappy deps are there. I'll take a peek at the dockerfiles --edit-- yah, they copy the whole jars directory.

mccheah commented 7 years ago

Can you post the result of kubectl describe pod <driver> and kubectl describe pod <executor>?

luck02 commented 7 years ago

driver output

Name:       kmderived-1506122359203-driver
Namespace:  default
Node:       ip-10-253-50-77.ec2.internal/10.253.50.77
Start Time: Fri, 22 Sep 2017 16:19:25 -0700
Labels:     role=kissmetrics
        spark-app-selector=spark-e65d6a8c895f47eead96caad62edfc7d
        spark-role=driver
Annotations:    podpreset.admission.kubernetes.io/podpreset-kissmetrics-env=9039918
        spark-app-name=kmderived
Status:     Running
IP:     10.2.106.14
Containers:
  spark-kubernetes-driver:
    Container ID:   docker://872c49fbee70f53dc3a8163e89d0a80abe0a0fce6360223641359c3c7749b403
    Image:      ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest
    Image ID:       docker-pullable://ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics@sha256:41d872998dbbb28882f176e164add56fd22d40a76d27381bb5f243b5c9bf1aa3
    Port:       <none>
    State:      Running
      Started:      Fri, 22 Sep 2017 16:19:37 -0700
    Ready:      True
    Restart Count:  0
    Limits:
      memory:   15769Mi
    Requests:
      cpu:  8
      memory:   14Gi
    Environment:
      SPARK_DRIVER_MEMORY:              14G
      SPARK_DRIVER_CLASS:               org.apache.spark.deploy.PythonRunner
      SPARK_DRIVER_ARGS:                
      SPARK_MOUNTED_CLASSPATH:              /opt/jars/hadoop-aws-2.7.3.jar:/opt/jars/aws-java-sdk-1.7.4.jar:/opt/jars/joda-time-2.9.3.jar:/opt/jars/mysql-connector-java-5.1.35.jar:/opt/jars/snappy-java-1.1.4.jar
      PYSPARK_PRIMARY:                  /opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py
      PYSPARK_FILES:                    /opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py
      SPARK_JAVA_OPT_0:                 -Dspark.kubernetes.docker.image.pullPolicy=Always
      SPARK_JAVA_OPT_1:                 -Dspark.kubernetes.namespace=default
      SPARK_JAVA_OPT_2:                 -Dspark.app.id=spark-e65d6a8c895f47eead96caad62edfc7d
      SPARK_JAVA_OPT_3:                 -Dspark.kubernetes.driver.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest
      SPARK_JAVA_OPT_4:                 -Dspark.submit.deployMode=cluster
      SPARK_JAVA_OPT_5:                 -Dspark.driver.blockManager.port=7079
      SPARK_JAVA_OPT_6:                 -Dspark.app.name=kmderived
      SPARK_JAVA_OPT_7:                 -Dspark.driver.bindAddress=kmderived-1506122359203-driver-svc.default.svc.cluster.local
      SPARK_JAVA_OPT_8:                 -Dspark.driver.host=kmderived-1506122359203-driver-svc.default.svc.cluster.local
      SPARK_JAVA_OPT_9:                 -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account
      SPARK_JAVA_OPT_10:                -Dspark.driver.memory=14G
      SPARK_JAVA_OPT_11:                -Dspark.driver.port=7078
      SPARK_JAVA_OPT_12:                -Dspark.driver.cores=8
      SPARK_JAVA_OPT_13:                -Dspark.kubernetes.executor.labels=role=kissmetrics
      SPARK_JAVA_OPT_14:                -Dspark.executor.memory=20G
      SPARK_JAVA_OPT_15:                -Dspark.kubernetes.executor.podNamePrefix=kmderived-1506122359203
      SPARK_JAVA_OPT_16:                -Dspark.executor.instances=10
      SPARK_JAVA_OPT_17:                -Dspark.executor.cores=6
      SPARK_JAVA_OPT_18:                -Dspark.kubernetes.driver.labels=role=kissmetrics
      SPARK_JAVA_OPT_19:                -Dspark.files=/opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py,/opt/ub-data-kissmetrics/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py
      SPARK_JAVA_OPT_20:                -Dspark.master=k8s://http://127.0.0.1:8001
      SPARK_JAVA_OPT_21:                -Dspark.kubernetes.initcontainer.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-init-kissmetrics:latest
      SPARK_JAVA_OPT_22:                -Dspark.kubernetes.executor.docker.image=ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest
      SPARK_JAVA_OPT_23:                -Dspark.jars=/opt/jars/hadoop-aws-2.7.3.jar,/opt/jars/aws-java-sdk-1.7.4.jar,/opt/jars/joda-time-2.9.3.jar,/opt/jars/mysql-connector-java-5.1.35.jar,/opt/jars/snappy-java-1.1.4.jar
      SPARK_JAVA_OPT_24:                -Dspark.kubernetes.driver.pod.name=kmderived-1506122359203-driver
      REPORTSDB_HOSTNAME:               reports-db.unbounce.net
      REPORTSDB_PORT:                   3306
      REPORTSDB_KISSMETRICS_INBOUND_WAREHOUSE_NAME: kissmetrics_inbound
      KM_MANIFEST_FILE:                 s3a://manifest/index.csv
      REPORTSDB_KISSMETRICS_WAREHOUSE_NAME:     kissmetrics_warehouse
      MAX_FILES_TO_CONSUME:             250
      OUTPUT_LOCATION:                  s3a://output
      AWS_DATA_S3_ACCESS_KEY:               <set to the key 'AWS_DATA_S3_ACCESS_KEY' in secret 'kissmetrics-k8s'>               Optional: false
      AWS_DATA_S3_SECRET_ACCESS_KEY:            <set to the key 'AWS_DATA_S3_SECRET_ACCESS_KEY' in secret 'kissmetrics-k8s'>            Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL:  <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>  Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL:   <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME: <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME' in secret 'kissmetrics-k8s'> Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD: <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD' in secret 'kissmetrics-k8s'> Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME:   <set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD:   <set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL:   <set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_READ_JDBCURL: <set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL:    <set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL' in secret 'kissmetrics-k8s'>    Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from spark-k8s-service-account-token-fn4qg (ro)
Conditions:
  Type      Status
  Initialized   True 
  Ready     True 
  PodScheduled  True 
Volumes:
  spark-k8s-service-account-token-fn4qg:
    Type:   Secret (a volume populated by a Secret)
    SecretName: spark-k8s-service-account-token-fn4qg
    Optional:   false
QoS Class:  Burstable
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen LastSeen    Count   From                    SubObjectPath                   Type        Reason          Message
  --------- --------    -----   ----                    -------------                   --------    ------          -------
  47s       47s     1   default-scheduler                                   Normal      Scheduled       Successfully assigned kmderived-1506122359203-driver to ip-10-253-50-77.ec2.internal
  47s       47s     1   kubelet, ip-10-253-50-77.ec2.internal                           Normal      SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "spark-k8s-service-account-token-fn4qg" 
  46s       46s     1   kubelet, ip-10-253-50-77.ec2.internal   spec.containers{spark-kubernetes-driver}    Normal      Pulling         pulling image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest"
  35s       35s     1   kubelet, ip-10-253-50-77.ec2.internal   spec.containers{spark-kubernetes-driver}    Normal      Pulled          Successfully pulled image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-driver-py-kissmetrics:latest"
  35s       35s     1   kubelet, ip-10-253-50-77.ec2.internal   spec.containers{spark-kubernetes-driver}    Normal      Created         Created container
  35s       35s     1   kubelet, ip-10-253-50-77.ec2.internal   spec.containers{spark-kubernetes-driver}    Normal      Started         Started container

executor:

Name:       kmderived-1506122359203-exec-1
Namespace:  default
Node:       ip-10-253-46-101.ec2.internal/10.253.46.101
Start Time: Fri, 22 Sep 2017 16:19:41 -0700
Labels:     role=kissmetrics
        spark-app-selector=spark-e65d6a8c895f47eead96caad62edfc7d
        spark-exec-id=1
        spark-role=executor
Annotations:    podpreset.admission.kubernetes.io/podpreset-kissmetrics-env=9039918
Status:     Running
IP:     10.2.36.12
Controlled By:  Pod/kmderived-1506122359203-driver
Containers:
  executor:
    Container ID:   docker://67035573284ccc7cb11808afbc66df0ea28923222e8d1a77326d40669a410f4c
    Image:      ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest
    Image ID:       docker-pullable://ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics@sha256:9c1515b8c5150717835d5405e5f052163ec453ed56d865dae354e11dd83b542f
    Ports:      10000/TCP, 7079/TCP
    State:      Running
      Started:      Fri, 22 Sep 2017 16:19:42 -0700
    Ready:      True
    Restart Count:  0
    Limits:
      memory:   22Gi
    Requests:
      cpu:  6
      memory:   20Gi
    Environment:
      SPARK_EXECUTOR_PORT:              10000
      SPARK_DRIVER_URL:                 spark://CoarseGrainedScheduler@kmderived-1506122359203-driver-svc.default.svc.cluster.local:7078
      SPARK_EXECUTOR_CORES:             6
      SPARK_EXECUTOR_MEMORY:                20G
      SPARK_APPLICATION_ID:             spark-e65d6a8c895f47eead96caad62edfc7d
      SPARK_EXECUTOR_ID:                1
      SPARK_MOUNTED_CLASSPATH:              /var/spark-data/spark-jars/*
      SPARK_EXECUTOR_POD_IP:                 (v1:status.podIP)
      REPORTSDB_HOSTNAME:               reports-db.unbounce.net
      REPORTSDB_PORT:                   3306
      REPORTSDB_KISSMETRICS_INBOUND_WAREHOUSE_NAME: kissmetrics_inbound
      KM_MANIFEST_FILE:                 s3a://manfiest/index.csv
      REPORTSDB_KISSMETRICS_WAREHOUSE_NAME:     kissmetrics_warehouse
      MAX_FILES_TO_CONSUME:             250
      OUTPUT_LOCATION:                  s3a://output
      AWS_DATA_S3_ACCESS_KEY:               <set to the key 'AWS_DATA_S3_ACCESS_KEY' in secret 'kissmetrics-k8s'>               Optional: false
      AWS_DATA_S3_SECRET_ACCESS_KEY:            <set to the key 'AWS_DATA_S3_SECRET_ACCESS_KEY' in secret 'kissmetrics-k8s'>            Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL:  <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>  Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL:   <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_READ_JDBCURL' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME: <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_USERNAME' in secret 'kissmetrics-k8s'> Optional: false
      REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD: <set to the key 'REPORTSDB_KISSMETRICS_INBOUND_WRITE_PASSWORD' in secret 'kissmetrics-k8s'> Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME:   <set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_USERNAME' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD:   <set to the key 'REPORTSDB_KISSMETRICS_WAREHOUSE_WRITE_PASSWORD' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL:   <set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_KISSMETRICS_WAREHOUSE_READ_JDBCURL: <set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_WRITE_JDBCURL' in secret 'kissmetrics-k8s'>   Optional: false
      REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL:    <set to the key 'REPORTSDB_PROD_APPLICATIONDBUSER_READ_JDBCURL' in secret 'kissmetrics-k8s'>    Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-00b84 (ro)
Conditions:
  Type      Status
  Initialized   True 
  Ready     True 
  PodScheduled  True 
Volumes:
  default-token-00b84:
    Type:   Secret (a volume populated by a Secret)
    SecretName: default-token-00b84
    Optional:   false
QoS Class:  Burstable
Node-Selectors: <none>
Tolerations:    <none>
Events:
  FirstSeen LastSeen    Count   From                    SubObjectPath           Type        Reason          Message
  --------- --------    -----   ----                    -------------           --------    ------          -------
  50s       50s     1   default-scheduler                           Normal      Scheduled       Successfully assigned kmderived-1506122359203-exec-1 to ip-10-253-46-101.ec2.internal
  50s       50s     1   kubelet, ip-10-253-46-101.ec2.internal                  Normal      SuccessfulMountVolume   MountVolume.SetUp succeeded for volume "default-token-00b84" 
  49s       49s     1   kubelet, ip-10-253-46-101.ec2.internal  spec.containers{executor}   Normal      Pulling         pulling image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest"
  49s       49s     1   kubelet, ip-10-253-46-101.ec2.internal  spec.containers{executor}   Normal      Pulled          Successfully pulled image "ecr_id.dkr.ecr.us-east-1.amazonaws.com/spark-executor-py-kissmetrics:latest"
  49s       49s     1   kubelet, ip-10-253-46-101.ec2.internal  spec.containers{executor}   Normal      Created         Created container
  49s       49s     1   kubelet, ip-10-253-46-101.ec2.internal  spec.containers{executor}   Normal      Started         Started container
ash211 commented 7 years ago

Hmm I haven't seen this, and we've been reading/writing parquet with Spark in k8s. Have you confirmed that class org.xerial.snappy.Snappy is in that snappy-java jar?

luck02 commented 7 years ago

@ash211 you're right it's not there.

META-INF/MANIFEST.MF META-INF/maven/org.iq80.snappy/snappy/pom.properties META-INF/maven/org.iq80.snappy/snappy/pom.xml org/iq80/snappy/BufferRecycler.class org/iq80/snappy/CorruptionException.class org/iq80/snappy/Crc32C.class org/iq80/snappy/HadoopSnappyCodec$SnappyCompressionInputStream.class org/iq80/snappy/HadoopSnappyCodec$SnappyCompressionOutputStream.class org/iq80/snappy/HadoopSnappyCodec.class org/iq80/snappy/Main.class org/iq80/snappy/Memory.class org/iq80/snappy/SlowMemory.class org/iq80/snappy/Snappy.class org/iq80/snappy/SnappyCompressor.class org/iq80/snappy/SnappyDecompressor.class org/iq80/snappy/SnappyInputStream.class org/iq80/snappy/SnappyInternalUtils.class org/iq80/snappy/SnappyOutputStream.class org/iq80/snappy/UnsafeMemory.class

Missing / broken snappy jar.

I built this off off the tip of 2.2 branch

garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  git rev-parse HEAD
8a0f485d83f84a9b556f7634fed8cb1b65e12159

So could be my build command was incorrect or there could be some issue in the branch.

Here was my build command:

garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  git rev-parse HEAD
8a0f485d83f84a9b556f7634fed8cb1b65e12159
garylucas@gary-lucas:~/dev/spark|branch-2.2-kubernetes⚡
⇒  ./dev/make-distribution.sh --pip --tgz -Pmesos -Pyarn -Pkinesis-asl -Phive -Phive-thriftserver -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.7.3

I got that from: https://github.com/apache-spark-on-k8s/spark/issues/438

The reason I'm building my own is listed here: https://github.com/apache-spark-on-k8s/spark/issues/494

The other issue i'm experiencing on this build is: https://github.com/apache-spark-on-k8s/spark/issues/505

Which i'm currently trying to debug.

luck02 commented 7 years ago

Sorry, I didn't realize there were two snappy jars in there and inspected the wrong one (IE snappy0.2.jar)

This appears to be correct: snappy-java-1.1.2.6.jar

META-INF/MANIFEST.MF org/xerial/snappy/OSInfo.class org/xerial/snappy/PureJavaCrc32C.class org/xerial/snappy/Snappy.class org/xerial/snappy/SnappyBundleActivator.class org/xerial/snappy/SnappyCodec.class org/xerial/snappy/SnappyError.class org/xerial/snappy/SnappyErrorCode.class org/xerial/snappy/SnappyException.class org/xerial/snappy/SnappyFramed.class org/xerial/snappy/SnappyFramedInputStream$FrameAction.class org/xerial/snappy/SnappyFramedInputStream$FrameData.class org/xerial/snappy/SnappyFramedInputStream$FrameMetaData.class org/xerial/snappy/SnappyFramedInputStream.class org/xerial/snappy/SnappyFramedOutputStream.class org/xerial/snappy/SnappyIOException.class org/xerial/snappy/SnappyInputStream.class org/xerial/snappy/SnappyLoader.class org/xerial/snappy/SnappyNative.class org/xerial/snappy/SnappyOutputStream.class org/xerial/snappy/VERSION org/xerial/snappy/buffer/BufferAllocator.class org/xerial/snappy/buffer/BufferAllocatorFactory.class org/xerial/snappy/buffer/CachedBufferAllocator$1.class org/xerial/snappy/buffer/CachedBufferAllocator.class org/xerial/snappy/buffer/DefaultBufferAllocator$1.class org/xerial/snappy/buffer/DefaultBufferAllocator.class org/xerial/snappy/native/AIX/ppc/libsnappyjava.a org/xerial/snappy/native/AIX/ppc64/libsnappyjava.a org/xerial/snappy/native/Linux/aarch64/libsnappyjava.so org/xerial/snappy/native/Linux/arm/libsnappyjava.so org/xerial/snappy/native/Linux/armhf/libsnappyjava.so org/xerial/snappy/native/Linux/ppc64/libsnappyjava.so org/xerial/snappy/native/Linux/ppc64le/libsnappyjava.so org/xerial/snappy/native/Linux/s390x/libsnappyjava.so org/xerial/snappy/native/Linux/x86/libsnappyjava.so org/xerial/snappy/native/Linux/x86_64/libsnappyjava.so org/xerial/snappy/native/Mac/x86/libsnappyjava.jnilib org/xerial/snappy/native/Mac/x86_64/libsnappyjava.jnilib org/xerial/snappy/native/README org/xerial/snappy/native/SunOS/sparc/libsnappyjava.so org/xerial/snappy/native/SunOS/x86/libsnappyjava.so org/xerial/snappy/native/SunOS/x86_64/libsnappyjava.so org/xerial/snappy/native/Windows/x86/snappyjava.dll org/xerial/snappy/native/Windows/x86_64/snappyjava.dll

luck02 commented 7 years ago

Looks like everything is present on the executors:

garylucas@gary-lucas:~/dev/ub-data-kissmetrics|DAT-spark-on-k8s⚡ ⇒ kubectl exec -it kmderived-1506306351957-exec-3 /bin/bash bash-4.3# ls Compute__Derived_NonAppPageViewed.py hadoop-aws-2.7.3.jar mysql-connector-java-5.1.35.jar aws-java-sdk-1.7.4.jar joda-time-2.9.3.jar snappy-java-1.1.2.6.jar bash-4.3# cd /opt/spark/ RELEASE/ bin/ conf/ examples/ jars/ python/ sbin/ work-dir/ bash-4.3# cd /opt/spark/jars/ bash-4.3# ls -alh | grep snappy -rw-r--r-- 18 root root 47.6K Sep 17 03:54 snappy-0.2.jar -rw-r--r-- 18 root root 1.0M Sep 17 03:54 snappy-java-1.1.2.6.jar bash-4.3#

luck02 commented 7 years ago

Huh, also when I expose the spark UI I see this in the 'environment variables':

/opt/spark/jars/snappy-0.2.jar | System Classpath
/opt/spark/jars/snappy-java-1.1.2.6.jar | System Classpath
luck02 commented 7 years ago

I upgraded to the recent build (https://github.com/apache-spark-on-k8s/spark/releases/tag/v2.2.0-kubernetes-0.4.0) in the hopes that my personal build was somehow hooped.

Same results:

3.535 s due to Job aborted due to stage failure: Task 28 in stage 3.0 failed 4 times, most recent failure: Lost task 28.3 in stage 3.0 (TID 363, 10.2.88.8, executor 13): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
    at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
    at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
    at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
    at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
    at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
    at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
    at org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:109)
    at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
    at 

I'm just taking some time to go through the executor logs but that'll take some time. Wondering if anyone else has the same issue.

mccheah commented 7 years ago

What's the nature of the other libraries on the classpath? Are there any jars with shaded dependencies or large jars that contain library classes like Snappy?

mccheah commented 7 years ago

Can you try running a Java or Scala job instead and see if there's similar behavior?

luck02 commented 7 years ago

Good thoughts, here's my jars:

      --jars 
local:/opt/jars/hadoop-aws-2.7.3.jar,
local:/opt/jars/aws-java-sdk-1.7.4.jar,
local:/opt/jars/joda-time-2.9.3.jar,
local:/opt/jars/mysql-connector-java-5.1.35.jar \

I'll introspect them later this eve and also test with a simple raw scala job.

luck02 commented 7 years ago

Ok, so no mention of snappy in those jars.

I submitted a simple scala based spark job that consumes a parquet from S3 that I know is valid, samples it down to 10% of original and writes it back to a new parquet

I included all the jars I've been using, mentioned in above comment (however I'm not running any queries against mysql or using joda-time).

No sign of similar problems to what I'm seeing above in pyspark. Read / write works

Next experiment, port sample job over to pyspark and repeat.

luck02 commented 7 years ago

So this just gets weirder and weirder.

Both test jobs scala and pyspark work. But my actual original job still fails. At this stage my theories are:

Here's my spark-submit script:

#!/usr/bin/env bash

PATH_TO_DRIVER="$PWD/jobs/kissmetrics_warehouse/Derived_NonAppPageViewed/Compute__Derived_NonAppPageViewed.py"

  ./infra/spark-2.2.0-k8s-0.4.0-bin-2.7.3/bin/spark-submit \
      --verbose \
      --deploy-mode cluster \
      --master k8s://http://127.0.0.1:8001 \
      --jars file://"$PWD/jars/hadoop-aws-2.7.3.jar",file://"$PWD/jars/aws-java-sdk-1.7.4.jar",file://"$PWD/jars/joda-time-2.9.3.jar",file://"$PWD/jars/mysql-connector-java-5.1.35.jar"\
      --kubernetes-namespace default \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-k8s-service-account \
      --conf spark.kubernetes.docker.image.pullPolicy=IfNotPresent \
      --conf spark.driver.memory=8G \
      --conf spark.driver.cores=4 \
      --conf spark.executor.instances=10 \
      --conf spark.executor.cores=4\
      --conf spark.executor.memory=8G \
      --conf spark.app.name=pytest-parquet \
      --conf spark.kubernetes.driver.label.role=kissmetrics \
      --conf spark.kubernetes.executor.label.role=kissmetrics \
      --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
      --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
      --conf spark.kubernetes.initcontainer.docker.image=kubespark/spark-init:v2.2.0-kubernetes-0.4.0 \
      --conf spark.kubernetes.resourceStagingServer.internal.uri=http://spark-resource-staging-service:10000\
      --conf spark.kubernetes.resourceStagingServer.uri=http://rss-url-numbers.us-east-1.elb.amazonaws.com:10000 \
      file://$PATH_TO_DRIVER
  1. mysql dependency (which is used in my job) is causing issues
  2. I'm doing something weird with my script and overwriting something in the wrong namespace somewhere.

The test scripts consumed a small-ish parquet file, sampled it down and produced a smaller file. This is trying to write about 10GB it looks like, IE not tiny but hardly huge.

Next I'm probably going to distill it down to a minimal problem and run that. Right now it takes awhile to iterate. I'll likely replace the current script with: 1: Hit Mysql db and get a bunch of rows. 2: Do some simple aggregation and filtering 3: Write to parquet.

That's essentially what I'm doing now but a 5-10 minute turnaround to fail time is a bit slow, however on the bright side I'm now using base images and don't have to build and push to ECR every darn time.

Error still seems to boil down too:

Suppressed: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy

Here's a subset of the logs I'm seeing (The whole trace is huge)

Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: 2017-10-02 05:32:47 INFO  InternalParquetRecordWriter:160 - Flushing mem columnStore to file. allocated memory: 2172814
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor:   } ]
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor: }
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor: and corresponding Parquet message type:
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor: message spark_schema {
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-7_default_executor:   required binary Fact_PageViewedEventId (UTF8);
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: 2017-10-02 05:32:47 ERROR Utils:91 - Aborting task
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor:    at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-8_default_executor:    at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)

Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  SQLHadoopMapReduceCommitProtocol:54 - Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  FileOutputCommitter:108 - File Output Committer Algorithm version is 1
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  SQLHadoopMapReduceCommitProtocol:54 - Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  CodecConfig:95 - Compression: SNAPPY
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  CodecConfig:95 - Compression: SNAPPY
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:329 - Parquet block size to 134217728
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:330 - Parquet page size to 1048576
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:331 - Parquet dictionary page size to 1048576
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:332 - Dictionary is on
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:333 - Validation is off
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:334 - Writer version is: PARQUET_1_0
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:335 - Maximum row group padding size is 0 bytes
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:336 - Page size checking is: estimated
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:337 - Min row count for page size check is: 100
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetOutputFormat:338 - Max row count for page size check is: 10000
Oct 01 22:32:47 data-cluster-slow-test k8s.pytest-parquet-1506921643054-exec-5_default_executor: 2017-10-02 05:32:47 INFO  ParquetWriteSupport:54 - Initialized Parquet WriteSupport with Catalyst schema:

Oct 01 22:32:47 :   at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
Oct 01 22:32:47 :   at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
Oct 01 22:32:47 :   at org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:109)
Oct 01 22:32:47 :   at org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:203)
Oct 01 22:32:47 :   at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:467)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$8.apply(ParquetWriteSupport.scala:163)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$makeWriter$8.apply(ParquetWriteSupport.scala:162)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields$1.apply$mcV$sp(ParquetWriteSupport.scala:124)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$consumeField(ParquetWriteSupport.scala:437)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.org$apache$spark$sql$execution$datasources$parquet$ParquetWriteSupport$$writeFields(ParquetWriteSupport.scala:123)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$write$1.apply$mcV$sp(ParquetWriteSupport.scala:114)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:425)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:113)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:51)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.write(ParquetOutputWriter.scala:40)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:327)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
Oct 01 22:32:47 :   at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
Oct 01 22:32:47 :   ... 8 more
Oct 01 22:32:47 :   Suppressed: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 :       at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
Oct 01 22:32:47 :       at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
Oct 01 22:32:47 :       at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
Oct 01 22:32:47 :       at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
Oct 01 22:32:47 :       at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
Oct 01 22:32:47 :       at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
Oct 01 22:32:47 :       at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
Oct 01 22:32:47 :       at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
Oct 01 22:32:47 :       at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
Oct 01 22:32:47 :       at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
Oct 01 22:32:47 :       at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
Oct 01 22:32:47 :       at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
Oct 01 22:32:47 :       at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:337)
Oct 01 22:32:47 :       at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:264)
Oct 01 22:32:47 :       at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1384)
Oct 01 22:32:47 :       ... 9 more
Oct 01 22:32:47 : 2017-10-02 05:32:47 WARN  FileOutputCommitter:569 - Could not delete s3a://bucketname/kissmetrics_warehouse/dev/gary/out/Derived_NonAppPageViewed.parquet/_temporary/0/_temporary/attempt_20171002053247_0003_m_000025_1
Oct 01 22:32:47 : 2017-10-02 05:32:47 ERROR FileFormatWriter:70 - Job job_20171002053247_0003 aborted.
Oct 01 22:32:47 : 2017-10-02 05:32:47 WARN  Utils:87 - Suppressing exception in catch: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 : java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.codec.SnappyCompressor.compress(SnappyCompressor.java:67)
Oct 01 22:32:47 :   at org.apache.hadoop.io.compress.CompressorStream.compress(CompressorStream.java:81)
Oct 01 22:32:47 :   at org.apache.hadoop.io.compress.CompressorStream.finish(CompressorStream.java:92)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.CodecFactory$BytesCompressor.compress(CodecFactory.java:112)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:93)
Oct 01 22:32:47 :   at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:150)
Oct 01 22:32:47 :   at org.apache.parquet.column.impl.ColumnWriterV1.flush(ColumnWriterV1.java:238)
Oct 01 22:32:47 :   at org.apache.parquet.column.impl.ColumnWriteStoreV1.flush(ColumnWriteStoreV1.java:121)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:167)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:109)
Oct 01 22:32:47 :   at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:337)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:264)
Oct 01 22:32:47 :   at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1384)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
Oct 01 22:32:47 :   at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
Oct 01 22:32:47 :   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
Oct 01 22:32:47 :   at org.apache.spark.scheduler.Task.run(Task.scala:108)
Oct 01 22:32:47 :   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
Oct 01 22:32:47 :   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Oct 01 22:32:47 :   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Oct 01 22:32:47 :   at java.lang.Thread.run(Thread.java:748)
luck02 commented 7 years ago

Does anyone know if Snappy is used for all parquet writes to S3? The test writes were quite a bit smaller.

mccheah commented 7 years ago

One should be able to inspect the written files to know if they're compressed with Snappy or not. My recollection is that they are always compressed.

luck02 commented 7 years ago

Just tested a write that works, got rid of the sample(true, 0.1)

Looked in executor logs: Oct 02 23:01:43 data-cluster-slow-test k8s.pytest-parquet-1507010416342-exec-1_default_executor: 2017-10-03 06:01:05 INFO CodecConfig:95 - Compression: SNAPPY

So this is just weird.

akakitani commented 7 years ago

@luck02 I ran into this as well and found this in the stack trace:

java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.2-6e299ad6-46dd-4ff3-aaac-112b4383ccf8-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: Nosuch file or directory (needed by /tmp/snappy-1.1.2-6e299ad6-46dd-4ff3-aaac-112b4383ccf8-libsnappyjava.so)
  at java.lang.ClassLoader$NativeLibrary.load(Native Method)
  at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
  at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
  at java.lang.Runtime.load0(Runtime.java:809)
  at java.lang.System.load(System.java:1086)
  at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:174)
  at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:152)
  at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)
  ... 48 elided

It seems like this is happening because the Spark Docker images are based on Alpine, which uses musl libc? I tried adding the libc6-compat apk package in custom images for the driver/executor and Snappy seems to work fine now

ifilonenko commented 7 years ago

Thank you for this investigation @akakitani !!!!!

luck02 commented 7 years ago

That's awesome @akakitani where did you find that? In the executor logs? I'll reproduce. The logs are super verbose so it's hard to find useful information.

akakitani commented 7 years ago

This was on the driver logs actually, and it was when trying to read some Parquet files. This has worked so far for me on writes too 👍

ifilonenko commented 7 years ago

@mccheah thoughts on modifying the docker images to include libc6-compat, if this in reality a fix? Is this one of the extra packages you guys are using in your customized Docker images?

ash211 commented 7 years ago

We install https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.25-r0/glibc-2.25-r0.apk which might be similar?

luck02 commented 7 years ago

Found it in our trace as well:

Oct 04 21:38:00 data-cluster-slow-test k8s.pytest-parquet-1507176181001-exec-2_default_executor: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.2-2506cd9c-0dc3-49eb-958a-a032e8b7b784-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/snappy-1.1.2-2506cd9c-0dc3-49eb-958a-a032e8b7b784-libsnappyjava.so)
Oct 04 21:38:00 data-cluster-slow-test k8s.pytest-parquet-1507176181001-exec-2_default_executor:    at java.lang.ClassLoader$NativeLibrary.load(Native Method)

Ugh, it only appears once. in about 20k lines of logs. Wow, needle in a haystack. Don't have time to complete experiment this evening by adding to the images.

Great Catch @akakitani ! 🎉

echarles commented 6 years ago

A fix is proposed in #550