commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Incompatible Architecture #34

Closed swetepete closed 1 year ago

swetepete commented 2 years ago

I am using a 2021 iMac with the Apple M1 chip and macOS Monterey 12.4.

So far to set up PySpark I have pip3 installed pyspark, plus cloned this repo and installed from the requirements.txt file, plus downloaded Java from their homepage. I'm using Python 3.8.9.

I added the path to the pip3installation of pyspark to SPARK_HOME in my .zshrc and sourced it:

% echo $SPARK_HOME
/Users/julius/Library/Python/3.8/lib/python/site-packages/pyspark

I then executed the following command:

$SPARK_HOME/bin/spark-submit ./server_count.py \
    --num_output_partitions 1 --log_level WARN \
    ./input/test_warc.txt servernames

I had to execute this from inside the cc-pyspark repo, otherwise the script could not find the program server_count.py.

It returns this error message:

julius@Juliuss-iMac cc-pyspark % $SPARK_HOME/bin/spark-submit ./server_count.py \
        --num_output_partitions 1 --log_level WARN \
        ./input/test_warc.txt servernames
Traceback (most recent call last):
  File "/Users/julius/cc-pyspark/server_count.py", line 1, in <module>
    import ujson as json
ImportError: dlopen(/Users/julius/Library/Python/3.8/lib/python/site-packages/ujson.cpython-38-darwin.so, 0x0002): tried: '/Users/julius/Library/Python/3.8/lib/python/site-packages/ujson.cpython-38-darwin.so' (mach-o file, but is an incompatible architecture (have 'arm64', need 'x86_64'))
22/07/06 15:04:13 INFO ShutdownHookManager: Shutdown hook called
22/07/06 15:04:13 INFO ShutdownHookManager: Deleting directory /private/var/folders/xv/yzpjb77s2qg14px8dc7g4m_80000gn/T/spark-80c476e9-b5ba-4710-b292-e367dd387ece

There's something wrong with my installation of "ujson", it is for arm, but PySpark is designed for x86? Is that correct?

What is the simplest way to fix this issue? Should I try to run PySpark in some kind of x86 emulation like Rosetta? Has PySpark not been designed for the M1 Chip?

Is there a chance this is the fault of my Java installation? I took the first one offered; it seemed to say x86, but when I tested running PySpark on its own, it seemed to work fine.

Thanks very much

sebastian-nagel commented 2 years ago

Hi @swetepete,

PySpark runs also on ARM - we use it in production on a Hadoop 3.2 / 3.3 based on Apache Bigtop, sometimes even on a mixed cluster (ARM and AMD64 machines).

The issue with ujson seems to be known, see on Stackoverflow or ultrajson#456.

Since ujson is an API-compatible but more performant replacement for the json module, you might work around the issue by

try:
  import ujson as json
except ImportError:
  import json
swetepete commented 2 years ago

Thank you. That, as well as pip installing psutil, has allowed the command to execute successfully.

The linked bug is tagged as "completed" - should I open a new bug with ujson's developers, seeing as there may be some new compatibility issue they don't know about, or is this something PySpark might be able to address?

Thank you

sebastian-nagel commented 2 years ago

open a new bug with ujson's developers

Nothing I can answer. After a closer look: the issue was fixed for ujson 5.0 and upwards: first, make sure that the latest ujson version is installed and the issue is reproducible.

is this something PySpark might be able to address?

If you mean "cc-pyspark": yes, we could the work-around using the json module as fall-back. But that's not a nice fix: makes the code less readable and less performant.

sebastian-nagel commented 1 year ago

Closing - a work-around exists and the underlying issue in ujson is resolved.