CODAIT / stocator

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.
Apache License 2.0
113 stars 72 forks source link

Running into issuues when trying to install and use stocator with pyspark #224

Closed ssdag111 closed 4 years ago

ssdag111 commented 5 years ago

So I'm trying to use stocator with pyspark (for reading and writing files to a bucket on IBM cloud storage) but am running into some issues on a linux system.

Versions for everything: stocator: stocator-1.0.36 (IBM-SDK) pyspark: spark-2.4.4 Python: Python 3.7.4

I run pyspark with the following commands:

pyspark --jars stocator-1.0.36-SNAPSHOT-IBM-SDK.jar or

pyspark --jars /_path to full location of_ /stocator-1.0.36-SNAPSHOT-IBM-SDK.jar I've added the following entries in spark-defaults.xml within the spark-2.4.4 folder in /opt/

spark.driver.extraClassPath = /home/_path to_ /stocator-1.0.36-SNAPSHOT-ibm-sdk/target/stocator-1.0.36-SNAPSHOT-IBM-SDK.jar
spark.executor.extraClassPath = /home/_path to_/stocator-1.0.36-SNAPSHOT-ibm-sdk/target/stocator-1.0.36-SNAPSHOT-IBM-SDK.jar

I've also created a core-sites.xml file in the same location as:

<configuration>
<property>
<name>fs.cos.impl</name>
<value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
</property>
<property>
<name>fs.stocator.cos.impl</name>
<value>com.ibm.stocator.fs.cos.COSAPIClient</value>
<property>
<property>
<name>fs.cos.mycos.iam.api.key</name>
<value>MY API KEY</value>
</property>
<property>
<!- open link https://cos-service.bluemix.net/endpoints and Choose relevant endpoint. You can also obtain this value from the IBM Cloud dashboard and section “location” –>
<name>fs.cos.mycos.endpoint</name>
<value>MY ENDPOINT</value>
</property>
<property>
<name>fs.cos.mycos.iam.service.id</name>
<value> ServiceId-XXXXXXXX</value>
</property>
</configuration>

As per the instructions here: https://developer.ibm.com/code/2018/08/16/installing-running-stocator-apache-spark-ibm-cloud-object-storage/

I want to use the "cos://" paradigm as highlighted in the post above and for the given test code below:

rdd = sc.textFile('file:///home/path to text file.txt')
abj = "cos://name_of_bucket/1.txt"
rdd.saveAsTextFile(abj)

I get an error

Py4JJavaError: An error occurred while calling o45.saveAsTextFile. : java.io.IOException: No FileSystem for scheme: cos at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:55) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1066) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478) at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550) at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

I also tried the following "swift://" instead of "cos://" and I get:

Py4JJavaError: An error occurred while calling o101.saveAsTextFile. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:55) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1066) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478) at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550) at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) ... 42 more

For "swift2d://" I get:

Py4JJavaError: An error occurred while calling o158.saveAsTextFile. : java.io.IOException: No FileSystem for scheme: swift2d

For "s3://" and "s3d://" I get:

Py4JJavaError: An error occurred while calling o214.saveAsTextFile. : java.io.IOException: No FileSystem for scheme: s3 / s3d

Can someone please help me use stocator with pyspark?

gilv commented 5 years ago

@ssdag111 you need to use pattern of the form cos://<yourbucket>.<service name>/<rest of path>. so based on the configuration you used, the patten would be

abj = "cos://name_of_bucket.mycos/1.txt"

also the problem you have is not Stocator related, but general usage how to add external jar to Spark. Seems Stocator jar is not loaded for some reason. Please double check this

ssdag111 commented 5 years ago

Thank you for your response. Okay I'll try out what you suggested here but what is a sureshot way of loading the jar? I cd into the directory where it is located and then run pyspark from there. I could also give the full path to the JAR file if needed..

ssdag111 commented 5 years ago

I just tried it out.. nothing..

abjj = "cos://test-bucket.mycos/1.txt"

leads to same error as before..

How do I check that the stocator jar is not getting loaded? I mean is there some means to check for this within python itself (display versions etc.)?

gilv commented 5 years ago

@ssdag111 i suggest you to see in Spark documentation how to load jars. It's not Stocator related issue...pure Spark + how to load external jar. May be your path is wrong to jar, may be you need to use some configuration in Spark to point to jar, may be some other command to load jar, etc..

ssdag111 commented 5 years ago

I just added an external jar by stopping the spark context and then restarting with a new one with the JAR loaded. Still nothing..

sc.stop() 
conf = SparkConf().set("spark.jars", "/path to/stocator-1.0.36-SNAPSHOT-ibm-sdk/target/stocator-1.0.36-SNAPSHOT-IBM-SDK.jar")
sc = SparkContext( conf=conf)
gilv commented 5 years ago

@ssdag111 your exception is above stocator...you need to verify all steps to make sure you properly include jar into Spark. Are you sure the jar at the location you specified?

ssdag111 commented 5 years ago

Yes.. I even cd'd into the directory where it is located and then started spark but to no avail..

bambrozio commented 4 years ago

Hi, how was it solved? I'm facing the same issue.

bambrozio commented 4 years ago

Ok, I managed to address it. Here's a snippet as an example that works (in case someone runs into the same issue): https://github.com/bambrozio/snippets/blob/master/cloud/icos/pyspark-icos-stocator.py