pyspark+hadoop集群下，加载自定义的字典文件总是报：无法找到自定义字典所在的路径

NYcleaner commented 2 years ago

INFO SparkContext:54 - Added file hdfs://xxxx/user_dict_2022.txt at hdfs://xxxx/user_dict_2022.txt with timestamp 1661761280375 Utils:54 - Fetching hdfs://xxxx/user_dict_2022.txt to /data/data16/yarn/nm2/usercache/o_zzzz/appcache/application_1655780863565_yyyy/spark-f9b4a2ca-aeba-45d7-ae8c-f3a40ddbab15/userFiles-9da5a8ee-8220-41dd-bd77-73aee4e92042/fetchFileTemp9124107129410970331.tmp Traceback (most recent call last): File "project1_jieba_train_online.py", line 137, in <module> jieba.load_userdict(user_dict_path) File "/data/data13/yarn/nm2/usercache/o_zzzz/appcache/application_1655780863565_yyyy/container_e4075_1655780863565_3544813_01_000001/py3/lib/python3.7/site-packages/jieba/__init__.py", line 398, in load_userdict f = open(f, 'rb') FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://xxxx/user_dict_2022.txt' ERROR ApplicationMaster:70 - User application exited with status 1 错误信息如上所示，

1.已经在pyspark submit 的--file 参数上添加了自定义字典所在的hdfs系统文件的绝对路径
--file hdfs://xxxx/user_dict_2022.txt

2在py文件里面加载自定义路径的代码如下： ` jieba.initialize()

user_dict_path='hdfs://xxxx/user_dict_2022.txt '

ss.sparkContext.addFile(user_dict_path)

jieba.load_userdict(user_dict_path)

main(ss, jieba) `

看以前的issue，还没有我这样的问题，特此来寻求大家帮助，多谢

hvgdfx commented 1 year ago

jieba可以直接读hdfs路径吗，应该用fileSystem来读吧

NYcleaner commented 1 year ago

最后的解决方案：使用 --archives hdfs://路径/文件.zip#别名然后jieba.load_userdict(别名)

hvgdfx commented 1 year ago

这是来自QQ邮箱的自动回复邮件。您好，来信已收到，我会尽快给您回复。

fxsjy / jieba

pyspark+hadoop集群下，加载自定义的字典文件总是报：无法找到自定义字典所在的路径 #975