Closed dkazakevich closed 5 years ago
@dkazakevich thanks. A bit strange, are you sure the same code works for CSV and fails for Parquet? The exception in the log is "Could not initialize class sun.security.ssl.SSLSessionImpl", which is related SSL and has nothing to do to Stocator. What stocator version (and a branch) you are using?
@dkazakevich a short update; I tested your code on Spark 2.3 and the same code works perfectly for me. So i need more details ( see my previous msg ) in order to proceed with it
@gilv Thank you for response. It works for JSON, but failed for parquet. I haven’t check fo CSV, will try it tomorrow. I tried 1.0.21-ibm-sdk, 1.0.22-ibm-sdk and 1.0.23-Snapshot-ibm-sdk branches. Also this works for spark client/local mode, but failed for claster mode with master and executor(s). I uses local minikube and bluemix k8s clusters to submit this spark job.
@dkazakevich I want to be sure i understand correctly. Is the following correct?
Do you use exactly the same code and the same path when access JSON and Parquet? SSL issues can't be related file types, so i try to figure out what else is different in your code
@gilv, yes, you are correct. I have the same behavior:
also works fine using local Spark cluster
the same exception using either minikube
or bluemix
:
Caused by: java.io.IOException: saving output test-json-rnd-values/part-00001-67f2343b-5c52-43c1-8266-9c1d617aeb50-c000-attempt_20180813152227_0000_m_000001_3.snappy.parquet com.ibm.stocator.thirdparty.cos.AmazonClientException: Unable to complete transfer: Could not initialize class sun.security.ssl.SSLSessionImpl at com.ibm.stocator.fs.cos.COSOutputStream.close(COSOutputStream.java:173) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:639) at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:117) at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:163) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:42) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:396) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) ... 8 more
yes, but there is one warning:
2018-08-13 15:20:10 WARN COSAPIClient:532 - Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 72facbe1-ec9a-4316-a0b9-1f13ac8b948c)
@dkazakevich @bahdzevich thanks. what JDK version you are using? I also wonder, what is the value of "fs.cos.service.endpoint" ?
@gilv We are using JDK 1.8.0_172
fs.cos.service.endpoint=s3-api.dal-us-geo.objectstorage.softlayer.net
We are using the same code and the same COS path for parquet and json. Just updated rows.write().mode(SaveMode.Overwrite).parquet(path)
to rows.write().mode(SaveMode.Overwrite).json(path)
. It also looks strange for us and don't know what the problem source.
@dkazakevich have you installed or do you need to install a cert on Kubernetes?
@paul-carron I haven't install certs on minikube and blumix k8s cluster. Also don't know is need cert to put data into COS.
I only put provided k8s cluster config yaml file with cluster certificate-authority .pem file into local ~/.kube directory.
Also created spark
k8s serviceaccount, clusterrolebinding and uses it to run spark job as described here: https://spark.apache.org/docs/2.3.0/running-on-kubernetes.html#rbac
@dkazakevich can you read some existing Parquet file from COS on this cluster?
@gilv reading Parquet file from COS and writing it back into COS works fine.
@dkazakevich so all is working, except when you create 2 records size Parquet file in Spark and write it into COS? This sounds a bit strange.. and it fails with SSL.. I wonder if you can experiment with some other ways to create Parquet files in Spark. Try to use Scala shell and write something like
` import spark.implicits._
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.format(“parquet”).save("cos://yourbucket.myCOS/data.parquet") `
Will this work?
@gilv Initially we faced with the problem in process of reading data from DB2 and writing into COS. The 2 records size
example is a simplified version that reproduce the same problem like we have with DB2 source.
I think that the scala shell example runs spark job in local/client mode that executes jobs using one master (one node). As I described above we also don't have problems in spark local/client mode. The problem accrues for spark cluster mode that runs job using master and one or some executor(s) (some nodes).
@dkazakevich you can connect spark-shell to the exiting cluster, by using --master spark://master_host:master_port
@gilv Because we don't have k8s spark cluster with running master and executors (that necessary for spark-shell), and uses spark-submit tool that automatically creates master and indicated number of executors for a spark job (feature of spark 2.3 for k8s), is it ok to create a scala application, generate a .jar file for the spark-submit and run it for the above experiment?
@dkazakevich I think it's the same to create jar with scala and use spark-submit
@gilv Tried the experiment and got the same exception:
object App {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.appName("COS spark").getOrCreate()
import sparkSession.sqlContext.implicits._
sparkSession.sparkContext.hadoopConfiguration.set("fs.stocator.scheme.list", "cos")
sparkSession.sparkContext.hadoopConfiguration.set("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
sparkSession.sparkContext.hadoopConfiguration.set("fs.stocator.cos.scheme", "cos")
sparkSession.sparkContext.hadoopConfiguration.set("fs.cos.service.iam.api.key", "***")
sparkSession.sparkContext.hadoopConfiguration.set("fs.cos.service.iam.service.id", "***")
sparkSession.sparkContext.hadoopConfiguration.set("fs.cos.service.endpoint", "s3-api.us-geo.objectstorage.softlayer.net")
val squaresDF = sparkSession.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.format("parquet").save("cos://mpw-plants.service/data.parquet")
}
}
Also FYI: I'm not sure that it could be the problem reason, but I tried to load data from some DB2 databases and write parquet into COS and found out that:
@dkazakevich if you can't write a simple Parquet with this Scala example, then it's not related DB2 i guess.. and the exception you getting is related an SSL connection.. but you can read from COS, write JSON without hitting SSL issues..
@dkazakevich please open also a ticket on IBM Bluemix Support. You can link them this issue that you opened against Stocator. You should see support in the console.bluemix.net and create a ticket
@dkazakevich At the don't believe this is be an issue with the Java SDK as I’m unable to recreate it in a stand alone or cluster environment. My hunch at the minute is that its some sort of cert issue between Spark and Kubernetes although you haven’t installed certs on their minicube cluster. Unfortunately I’m not familiar with Kubernetes so don’t know what might be required. It might be worth having somebody with Kubernetes knowledge look at this.
Created a sample Spark 2.3 application that runs in k8s cluster. The application creates a sample Dataset and trying to put it into Bluemix COS using stocator:
But I'm getting exception:
Writing the same data to COS json file works fine: