dataabc / weibo-search

获取微博搜索结果信息,搜索即可以是微博关键词搜索,也可以是微博话题搜索
1.66k stars 370 forks source link

表情 在 gbk 无法识别 #440

Open Thisisnotgoingpublished opened 9 months ago

Thisisnotgoingpublished commented 9 months ago

运行报错: UnicodeEncodeError: 'gbk' codec can't encode character '\U0001f525' in position 400: illegal multibyte sequence

处理字符时遇到了 Unicode 编码问题,'gbk' 编码不支持。字符 '\U0001f525' 是🔥表情符号。

Thisisnotgoingpublished commented 9 months ago

安装前调用emoji

pip install emoji

然后把 .\weibo\spiders\search.py 前面加入 import emoji

然后#掉倒数第二行 print(weibo) 改为下方内容

text_to_demj = weibo.get('text', '') clean_text = emoji.demojize(text_to_demj) print(clean_text)

Thisisnotgoingpublished commented 9 months ago

或者 我不知道应该怎么写变成原来的输出 我不会编程 希望作者注意一下 我作为小白觉得应该把所有文本当做utf-8或者gbk,这样半落砢矶的不太好

Thisisnotgoingpublished commented 9 months ago

不行 我无法了 它还是在报错

PS H:\weibo-search-master\weibo> scrapy crawl search -s JOBDIR=crawls/search >> ./a.txt 2023-12-13 11:27:43 [scrapy.core.scraper] ERROR: Spider error processing <GET https://s.weibo.com/weibo?q=<保密>&typeall=1&suball=1&timescope=custom:2023-12-12-0:2023-12-13-0&page=1> (referer: https://s.weibo.com/weibo?q=<保密>&typeall=1&suball=1&timescope=custom:2023-12-11-0:2023-12-14-0) Traceback (most recent call last): File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback yield next(it) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\python.py", line 350, in next return next(self.data) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\python.py", line 350, in next return next(self.data) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in return (r for r in result or () if self._filter(r, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 352, in return (self._set_referer(r, response) for r in result or ()) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 27, in return (r for r in result or () if self._filter(r, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 31, in return (r for r in result or () if self._filter(r, response, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "H:\weibo-search-master\weibo\spiders\search.py", line 154, in parse_by_day for weibo in self.parse_weibo(response): File "H:\weibo-search-master\weibo\spiders\search.py", line 542, in parse_weibo print(clean_text) UnicodeEncodeError: 'gbk' codec can't encode character '\ufffc' in position 58: illegal multibyte sequence 2023-12-13 11:27:54 [scrapy.core.scraper] ERROR: Spider error processing <GET https://s.weibo.com/weibo?q=<保密>&typeall=1&suball=1&timescope=custom:2023-12-11-0:2023-12-12-0&page=1> (referer: https://s.weibo.com/weibo?q=<保密>&typeall=1&suball=1&timescope=custom:2023-12-11-0:2023-12-14-0) Traceback (most recent call last): File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback yield next(it) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\python.py", line 350, in next return next(self.data) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\python.py", line 350, in next return next(self.data) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in return (r for r in result or () if self._filter(r, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 352, in return (self._set_referer(r, response) for r in result or ()) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 27, in return (r for r in result or () if self._filter(r, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 31, in return (r for r in result or () if self._filter(r, response, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "H:\weibo-search-master\weibo\spiders\search.py", line 154, in parse_by_day for weibo in self.parse_weibo(response): File "H:\weibo-search-master\weibo\spiders\search.py", line 542, in parse_weibo print(clean_text) UnicodeEncodeError: 'gbk' codec can't encode character '\ue662' in position 11: illegal multibyte sequence 2023-12-13 11:29:44 [scrapy.core.scraper] ERROR: Spider error processing <GET https://s.weibo.com/weibo?q=<保密>&typeall=1&suball=1&timescope=custom:2023-12-11-0:2023-12-14-0> (referer: None) Traceback (most recent call last): File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback yield next(it) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\python.py", line 350, in next return next(self.data) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\utils\python.py", line 350, in next return next(self.data) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in return (r for r in result or () if self._filter(r, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 352, in return (self._set_referer(r, response) for r in result or ()) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 27, in return (r for r in result or () if self._filter(r, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 31, in return (r for r in result or () if self._filter(r, response, spider)) File "C:\Users\<保密>\AppData\Local\Programs\Python\Python39\lib\site-packages\scrapy\core\spidermw.py", line 106, in process_sync for r in iterable: File "H:\weibo-search-master\weibo\spiders\search.py", line 110, in parse for weibo in self.parse_weibo(response): File "H:\weibo-search-master\weibo\spiders\search.py", line 542, in parse_weibo print(clean_text) UnicodeEncodeError: 'gbk' codec can't encode character '\xb9' in position 35: illegal multibyte sequence

Thisisnotgoingpublished commented 9 months ago

忘记之前写的所有代码 只是简简单单的把倒数第二行标记起来 下下面填上 跳过所有错误 暂时万事大吉 等待作者上线修一修

/*                print(weibo)              */
                try:
                    print(str(weibo))
                except UnicodeEncodeError as e:
                    print("Error occurred while encoding:", e)
                yield {'weibo': weibo, 'keyword': keyword}
dataabc commented 9 months ago

感谢热心反馈。我现在不方便调试,有时间会再调试下,感谢。