MrPowers / ceja

PySpark phonetic and string matching algorithms
MIT License
35 stars 5 forks source link

pip install Download pyspark as default & fails to work #3

Open yelled1 opened 3 years ago

yelled1 commented 3 years ago

When I pip install ceja, I automatically get pyspark-3.1.1.tar.gz (212.3MB) which is a problem because it's the wrong version (using 3.0.0 on both EMR & WSL). Even when I eliminate it, I still get errors on EMR. Can this behavior be stopped?

[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 install ceja
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting ceja
  Downloading https://files.pythonhosted.org/packages/c6/80/f372c62a83175f4c54229474f543aeca3344f4c64aab4bcfe7cf05f50cbf/ceja-0.2.0-py3-none-any.whl
Collecting pyspark>2.0.0 (from ceja)
  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
    100% |████████████████████████████████| 212.3MB 6.3kB/s
Collecting jellyfish<0.9.0,>=0.8.2 (from ceja)
  Downloading https://files.pythonhosted.org/packages/04/3f/d03cb056f407ef181a45569255348457b1a0915fc4eb23daeceb930a68a4/jellyfish-0.8.2.tar.gz (134kB)
    100% |████████████████████████████████| 143kB 9.1MB/s
Collecting py4j==0.10.9 (from pyspark>2.0.0->ceja)
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
    100% |████████████████████████████████| 204kB 6.5MB/s
Installing collected packages: py4j, pyspark, jellyfish, ceja
  Running setup.py install for pyspark ... done
  Running setup.py install for jellyfish ... done
Successfully installed ceja-0.2.0 jellyfish-0.8.2 py4j-0.10.9 pyspark-3.1.1

[hadoop@ip-172-31-89-109 ~]$ sudo /usr/bin/pip3 uninstall pyspark
Proceed (y/n)? y
..(snip)..
  Successfully uninstalled pyspark-3.1.1

When I do above & attempt to use:

>>> df_m.columns
['guid_consumer_hashed_df10', 'guid_customer_hashed_df10', 'guidr_m', 'jws_fnm_m', 'jws_lnm_m', 'gender_m', 'state_m', 'zip3_m', 'soundex_fnm_m', 'lev_gender_m', 'lev_state_m', 'l
ev_zip3_m', 'lev_soundex_fnm_m']

jws_???_m are created with:

...     .withColumn(
...         "jws_fnm_m",
...         ceja.jaro_winkler_similarity(f.col("firstname_df10"), f.col("firstname_df4")),
...     )

I can see columns but show fails

>>> df_m.show()
21/03/26 06:01:50 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 40007, ip-172-31-80-99.ec2.internal, executor 1): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616735627370_0001/container_1616735627370_0001_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'jellyfish'

attempting install fails

$ sudo /usr/bin/pip3 install jellifish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Collecting jellifish
  Could not find a version that satisfies the requirement jellifish (from versions: )
No matching distribution found for jellifish
MrPowers commented 3 years ago

@yelled1 - I removed the hard dependency on PySpark, hopefully this will solve the issue.

The hard PySpark dependency caused an issue on another project as well.

I just published ceja v0.3.0. It should be in PyPi.

Can you try again and let me know if the new version solves your issue?

yelled1 commented 3 years ago

@MrPowers, pyspark did not download, which is great & thanks a bunch, but got jellyfish error below. Still, I do have it installed:

[hadoop@ip-172-31-83-44 ~]$ sudo /usr/bin/pip3 install jellyfish
WARNING: Running pip install with root privileges is generally not a good idea. Try `pip3 install --user` instead.
Requirement already satisfied: jellyfish in /usr/local/lib/python3.7/site-packages

Also note that this is NOT an issue with WSL2, but it is in EMR (wsl2 was reinstall & EMR was fresh. Using findspark.py on both before initiating spark on vim. import jellyfish works. Just confirmed that the same error happens under spark-submit.

>>> df_m.show()
21/03/28 15:11:48 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 30005, ip-172-31-86-169.ec2.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (m
ost recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
**ModuleNotFoundError: No module named 'jellyfish'**

        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:503)
        at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
        at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
        at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

21/03/28 15:11:48 ERROR TaskSetManager: Task 0 in stage 7.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 441, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 589, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 447, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 254, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/worker.py", line 74, in read_command
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 172, in _read_with_leng
th
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/serializers.py", line 458, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1616943682272_0001/container_1616943682272_0001_01_000004/pyspark.zip/pyspark/cloudpickle.py", line 1110, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'jellyfish'
yelled1 commented 3 years ago

Below will work, but this means only works for spark-submit, while VSCode or vi repl will NOT.

mkdir $HOME/lib
pip3 install ceja -t $HOME/lib/
cd $HOME/lib/
zip -r  ~/include_py_modules.zip .
cd $HOME/

/usr/bin/nohup spark-submit --packages io.delta:delta-core_2.12:0.7.0 --py-files $HOME/include_py_modules.zip --driver-memory 8g --executor-memory 8g my_python_script.py > ~/output.log 2>&1 &
MrPowers commented 3 years ago

Yea, perhaps vendoring Jellyfish is the best path forward to avoid the transitive dependency issue. Python packaging is difficult and even harder when Spark is added to the mix.