support for Chinese/different lanuage/non-default encoding?

alirezamika / autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

MIT License

6.24k stars 654 forks source link

support for Chinese/different lanuage/non-default encoding? #46

Closed amsteel closed 3 years ago

amsteel commented 3 years ago

The build returns blank for this url:

https://www.ptwxz.com/html/11/11014/

from autoscraper import AutoScraper

scraper = AutoScraper()

url = 'https://www.ptwxz.com/html/11/11014/'
wanted_list = ['第一章']
scraper.build(url, wanted_list)
print(scraper.stack_list)

url1 = 'https://www.ptwxz.com/html/9/9108/'
print(scraper.get_result_similar(url1))

alirezamika commented 3 years ago

from autoscraper import AutoScraper
import requests

scraper = AutoScraper()

url = 'https://www.ptwxz.com/html/11/11014/'
wanted_list = ['第一章 牢狱之灾']
r = requests.get(url)
r.encoding = r.apparent_encoding
result = scraper.build(url, html=r.text, wanted_list=wanted_list)
print(result)

amsteel commented 3 years ago

Thank you. It works and apparently just encoding issue. Just find another similar one in the closed issue. Maybe worth add something in the Doc/wiki.

alirezamika commented 3 years ago

Fixed in v1.1.12