不能用python执行spark job

xiayank commented 7 years ago

我可以用spark-submit来执行spark job。但是用python直接执行就会报错ModuleNotFoundError: No module named 'py4j'. 这是log：

NIC@Yan-Mac  ~/Documents/504_BankEnd/DemoCode/week7_codelab1  python demo0.py demo1.txt
Traceback (most recent call last):
  File "demo0.py", line 2, in <module>
    from pyspark import SparkContext
  File "/usr/local/spark/python/pyspark/__init__.py", line 44, in <module>
    from pyspark.context import SparkContext
  File "/usr/local/spark/python/pyspark/context.py", line 29, in <module>
    from py4j.protocol import Py4JError
ModuleNotFoundError: No module named 'py4j'

这是我的环境变量。

export PATH="/usr/local/git/bin:/sw/bin/:/usr/local/bin:/usr/local/:/usr/local/sbin:/usr/local/mysql/bin:$PATH"
export SPARK_HOME=/usr/local/spark/
export PATH="$SPARK_HOME/bin:$PATH"
export PYTHONPATH="$SPARK_HOME/python:$PYTHONPATH"
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH

export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

# added by Anaconda3 4.3.1 installer
export PATH="/Users/NIC/anaconda/bin:$PATH"

hackjutsu commented 7 years ago

@xiayank 能确认一下面路径是否失效？

$SPARK_HOME/python/lib/py4j-0.10.1-src.zip

xiayank commented 7 years ago

是的。确实我的版本不是这个。解决了。

xiayank commented 7 years ago

我的版本是py4j-0.10.4-src.zip. 把版本更改一下就可以运行了。感谢助教！

xiayank commented 7 years ago

@hackjutsu 我用sudo pip install -U nltk安装nltk，结果报下面的错。搜了半天也没什么解决办法。

Collecting six (from nltk)
  Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, nltk
  Found existing installation: six 1.4.1
    DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
    Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 342, in run
    prefix=options.prefix_path,
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 778, in install
    requirement.uninstall(auto_confirm=True)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_install.py", line 754, in uninstall
    paths_to_remove.remove(auto_confirm)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_uninstall.py", line 115, in remove
    renames(path, new_path)
  File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/__init__.py", line 267, in renames
    shutil.move(old, new)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
    copy2(src, real_dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
    copystat(src, dst)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
    os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-xXGrka-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

我试图用sudo -H pip uninstall nltkuninstall 然后再安装，提示Cannot uninstall requirement nltk, not installed.

hackjutsu commented 7 years ago

建议在virtualenv环境下跑python。要安装的package和macOS自带的Python package冲突了。

@xiayank 能给出测试用的代码？把代码简化一下，让别人也能快速地重复你遇到的问题。

hackjutsu commented 7 years ago

@xiayank 我周四CodeLab时候讲讲如何使用Python Virtual environment吧。

xiayank commented 7 years ago

@hackjutsu 好的谢谢助教我先自己研究一下。

xiayank commented 7 years ago

@hackjutsu I used python3 to install and download nltk. It works. But when I run the generate_word2vec_training_data.py, it throws ExceptionTypeError: cannot use a string pattern on a bytes-like object. I guess the variable query is bytes type. I tried thisquery_tokens = cleanData(query.decode()) to convert is to String, but same case. I am new to Python. I guess this is because of the difference between Python2 and 3. Here is log info:

TypeError                                 Traceback (most recent call last)
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in <module>()
     30                     title = entry["title"].lower().encode('utf-8')
     31                     query = entry["query"].lower().encode('utf-8')
---> 32                     query_tokens = cleanData(query)
     33
     34

/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in cleanData(input)
     15 def cleanData(input) :
     16     #remove stop words
---> 17     list_of_tokens = [i.lower() for i in wordpunct_tokenize(input) if i.lower() not in stop_words ]
     18     return list_of_tokens
     19

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)
    127         # If our regexp matches tokens, use re.findall:
    128         else:
--> 129             return self._regexp.findall(text)
    130
    131     def span_tokenize(self, text):

TypeError: cannot use a string pattern on a bytes-like object

xiayank commented 7 years ago

@hackjutsu 运行python virtual environment后，用python执行`可以了。但是如果用spark-sumbit提交，log里面warning```UserWarning: Attempting to work in a virtualenv```,但是exception里面报错的信息是/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)`，执行的时候应该还是用了系统的py3.6。很奇怪。

(ENV)  NIC@Yan-Mac  ~/Documents/504_BankEnd/DemoCode/week7_codelab1  spark-submit --master "local[4]" generate_word2vec_training_data.py ads_0502.txt traning_data_0502.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py:706: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please install IPython inside the virtualenv.
  warn("Attempting to work in a virtualenv. If you encounter problems, please "
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in <module>()
     30                     title = entry["title"].lower().encode('utf-8')
     31                     query = entry["query"].lower().encode('utf-8')
---> 32                     query_tokens = cleanData(query)
     33
     34

/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in cleanData(input)
     15 def cleanData(input) :
     16     #remove stop words
---> 17     list_of_tokens = [i.lower() for i in wordpunct_tokenize(input) if i.lower() not in stop_words ]
     18     return list_of_tokens
     19

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)
    127         # If our regexp matches tokens, use re.findall:
    128         else:
--> 129             return self._regexp.findall(text)
    130
    131     def span_tokenize(self, text):

TypeError: cannot use a string pattern on a bytes-like object

hackjutsu commented 7 years ago

因为spark-submit默认引用的是系统的Python。Virtualenv只是把ENV里的python路径放到PATH最前。如果spark-submit不是根据PATH来选择Python的话，也有可能会用system的python。

-- Update -- 参考贴？ http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/

BitTigerInst / BitTiger-CS504-FAQ

不能用python执行spark job #39