Open xiayank opened 7 years ago
@xiayank 能确认一下面路径是否失效?
$SPARK_HOME/python/lib/py4j-0.10.1-src.zip
是的。确实我的版本不是这个。解决了。
我的版本是py4j-0.10.4-src.zip. 把版本更改一下就可以运行了。感谢助教!
@hackjutsu 我用sudo pip install -U nltk
安装nltk,结果报下面的错。搜了半天也没什么解决办法。
Collecting six (from nltk)
Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, nltk
Found existing installation: six 1.4.1
DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 342, in run
prefix=options.prefix_path,
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 778, in install
requirement.uninstall(auto_confirm=True)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_install.py", line 754, in uninstall
paths_to_remove.remove(auto_confirm)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_uninstall.py", line 115, in remove
renames(path, new_path)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/__init__.py", line 267, in renames
shutil.move(old, new)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-xXGrka-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'
我试图用sudo -H pip uninstall nltk
uninstall 然后再安装,提示Cannot uninstall requirement nltk, not installed
.
建议在virtualenv环境下跑python。要安装的package和macOS自带的Python package冲突了。
@xiayank 能给出测试用的代码?把代码简化一下,让别人也能快速地重复你遇到的问题。
@xiayank 我周四CodeLab时候讲讲如何使用Python Virtual environment吧。
@hackjutsu 好的 谢谢助教 我先自己研究一下。
@hackjutsu
I used python3 to install and download nltk. It works. But when I run the generate_word2vec_training_data.py
, it throws ExceptionTypeError: cannot use a string pattern on a bytes-like object
. I guess the variable query
is bytes type. I tried thisquery_tokens = cleanData(query.decode())
to convert is to String, but same case.
I am new to Python. I guess this is because of the difference between Python2 and 3.
Here is log info:
TypeError Traceback (most recent call last)
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in <module>()
30 title = entry["title"].lower().encode('utf-8')
31 query = entry["query"].lower().encode('utf-8')
---> 32 query_tokens = cleanData(query)
33
34
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in cleanData(input)
15 def cleanData(input) :
16 #remove stop words
---> 17 list_of_tokens = [i.lower() for i in wordpunct_tokenize(input) if i.lower() not in stop_words ]
18 return list_of_tokens
19
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: cannot use a string pattern on a bytes-like object
@hackjutsu
运行python virtual environment后,用python执行`可以了。但是如果用
spark-sumbit提交,log里面warning```UserWarning: Attempting to work in a virtualenv```,但是exception里面报错的信息是
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)`,执行的时候应该还是用了系统的py3.6。很奇怪。
(ENV) NIC@Yan-Mac ~/Documents/504_BankEnd/DemoCode/week7_codelab1 spark-submit --master "local[4]" generate_word2vec_training_data.py ads_0502.txt traning_data_0502.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py:706: UserWarning: Attempting to work in a virtualenv. If you encounter problems, please install IPython inside the virtualenv.
warn("Attempting to work in a virtualenv. If you encounter problems, please "
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in <module>()
30 title = entry["title"].lower().encode('utf-8')
31 query = entry["query"].lower().encode('utf-8')
---> 32 query_tokens = cleanData(query)
33
34
/Users/NIC/Documents/504_BankEnd/DemoCode/week7_codelab1/generate_word2vec_training_data.py in cleanData(input)
15 def cleanData(input) :
16 #remove stop words
---> 17 list_of_tokens = [i.lower() for i in wordpunct_tokenize(input) if i.lower() not in stop_words ]
18 return list_of_tokens
19
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/tokenize/regexp.py in tokenize(self, text)
127 # If our regexp matches tokens, use re.findall:
128 else:
--> 129 return self._regexp.findall(text)
130
131 def span_tokenize(self, text):
TypeError: cannot use a string pattern on a bytes-like object
因为spark-submit
默认引用的是系统的Python。Virtualenv只是把ENV
里的python路径放到PATH
最前。如果spark-submit
不是根据PATH
来选择Python的话,也有可能会用system的python。
-- Update -- 参考贴? http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv/
我可以用spark-submit来执行spark job。但是用python直接执行就会报错
ModuleNotFoundError: No module named 'py4j'
. 这是log:这是我的环境变量。