Open yelled1 opened 3 years ago
@yelled1 - I removed the hard dependency on PySpark, hopefully this will solve the issue.
The hard PySpark dependency caused an issue on another project as well.
I just published ceja v0.3.0. It should be in PyPi.
Can you try again and let me know if the new version solves your issue?
@MrPowers, pyspark did not download, which is great & thanks a bunch, but got jellyfish
error below.
Still, I do have it installed:
[hadoop@ip-172-31-83-44 ~]$ sudo /usr/bin/pip3 install jellyfish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Requirement already satisfied: jellyfish in /usr/local/lib/python3.7/site-packages
Also note that this is NOT an issue with WSL2, but it is in EMR (wsl2 was reinstall & EMR was fresh.
Using findspark.py
on both before initiating spark on vim.
import jellyfish works.
Just confirmed that the same error happens under spark-submit.
>>> df_m.show()
21/03/28 15:11:48 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 30005, ip-172-31-86-169.ec2.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
**ModuleNotFoundError: No module named 'jellyfish'**
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
21/03/28 15:11:48 ERROR TaskSetManager: Task 0 in stage 7.0 failed 4 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 441, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
command = serializer._read_with_length(file)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
return self.loads(obj)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
return pickle.loads(obj, encoding=encoding)
File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
__import__(name)
ModuleNotFoundError: No module named 'jellyfish'
Below will work, but this means only works for spark-submit
, while VSCode or vi repl will NOT.
mkdir $HOME/lib
pip3 install ceja -t $HOME/lib/
cd $HOME/lib/
zip -r ~/include_py_modules.zip .
cd $HOME/
/usr/bin/nohup spark-submit --packages io.delta:delta-core_2.12:0.7.0 --py-files $HOME/include_py_modules.zip --driver-memory 8g --executor-memory 8g my_python_script.py > ~/output.log 2>&1 &
Yea, perhaps vendoring Jellyfish is the best path forward to avoid the transitive dependency issue. Python packaging is difficult and even harder when Spark is added to the mix.
When I
pip install ceja
, I automatically get pyspark-3.1.1.tar.gz (212.3MB) which is a problem because it's the wrong version (using 3.0.0 on both EMR & WSL). Even when I eliminate it, I still get errors on EMR. Can this behavior be stopped?When I do above & attempt to use:
jws_???_m are created with:
I can see
columns
butshow
failsattempting install fails