disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

Do we really need selenium to read kknews and read01? #82

Open pm5 opened 4 years ago

pm5 commented 4 years ago

I recently found that you could actually read kknews.cc and read01.com articles with text-mode browsers such as w3m (or lynx, but not elinks, curiously), which means we probably don't need to run javascript to scrape their contents:

$ w3m https://read01.com/RMG5MJJ.html#.Xk1Ym1XLeo8
$ w3m https://kknews.cc/

Was it because of proxies or some other problems that we need selenium to scrape these two sites in the first place?

toutiao.com remains javascript-only, so personally I think this issue is of a lower priority since we are not getting away with selenium all together anyway.

andreawwenyi commented 4 years ago

it was because majority of kknews.com and read01.com article would first 302 to a "loading page" before reaching the actual content page. So a lot of the htmls we collected are the loading page.

pm5 commented 4 years ago

I see. Maybe we can look into that again when we have some time.