gallantlab / cottoncandy

sugar for s3
http://gallantlab.github.io/cottoncandy/
BSD 2-Clause "Simplified" License
33 stars 16 forks source link

Issue while running cottoncandy in AWS EMR as part of spark job #72

Open DeepakSahoo-Reflektion opened 5 years ago

DeepakSahoo-Reflektion commented 5 years ago

I am trying to use cottoncandy to upload my numpy arrays to s3 and this code is getting executed inside AWS EMR spark. But while running getting the below errors:-

OSError: [Errno [Errno 13] Permission denied: '/home/.config'] <function subimport at 0x7f87e0167320>: ('cottoncandy',)

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

OSError: [Errno 13] Permission denied: '/home/.config'

at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
anwarnunez commented 5 years ago

Hi, thanks for submitting the bug report!

Unfortunately, I am not familiar with AWS EMR at all. My guess is that the code is being executed by a user who does not have a $HOME. For this reason, cottoncandy is looking for the user configuration in an unusual place (/home/.config) and is unable to find the configuration. The configuration is normallly located in $HOME/.config/cottoncandy. Do you know whether AWS EMR submits the spark jobs under a system user? There are a couple of ways of solving this issue. The simplest might be to fallback to the default configuration whenever the user configuration cannot be found.

DeepakSahoo-Reflektion commented 5 years ago

I just fixed the issue by creating the home/.config directory. But then I have another issue when running inside spark. "pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects". Looks like it can't be serialized.

anwarnunez commented 5 years ago

Hello,

Great that you solved the issue by creating a /home/.config directory (though it's odd that the path is not /home/$USER/.config).

With respect to the PicklingError you ran into: Is it because the object you're trying to pickle is not serializable? Or is it because the Apache/Spark framework somehow pickles the cottoncandy.interface?

I'm sorry I don't have access to, nor am I familiar with, EMR, Apache, nor Spark. I'm happy to help as much as I can though! If you find a solution, I'd be happy to include fixes.

anwarnunez commented 5 years ago

p.s. I found this: https://stackoverflow.com/questions/40674544/apache-spark-reads-for-s3-cant-pickle-thread-lock-objects

It suggests that the issue is the serializing of the cottoncandy.interface. It suggests using mapPartitions instead of flatMap.