disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

appledaily needs new mechanism #111

Closed andreawwenyi closed 4 years ago

andreawwenyi commented 4 years ago

Appledaily has changed their website mechanism. Now the list of articles are loaded dynamically, and hence we could not get any new articles because a page would stuck like this:

Screen Shot 2020-04-24 at 5 10 39 PM

I have tried using selenium, however it makes the current login failed. So we need to look into if there's a better way to resolve this problem.

One possible approach is to use selenium to collect article urls, and then use another spider to login and grab the content of articles.

Or we might be able to use their api, an example api is https://tw.appledaily.com/pf/api/v3/content/fetch/collections?query={"id":"xxx","website":"tw-appledaily"}&d=70&_website=tw-appledaily. The "id" parameter is required to use this api and hence we need to figure out how to get that id.

andreawwenyi commented 4 years ago

solution (in PR #112) we use BasicDiscoverSpider with selenium for discover, and no changes to update. We adjusted snapshot date as followed:

  1. after discover, store snapshot but make next_snapshot_at the same current time, because we do not log in therefore the first snapshot most likely would not contain any content.
  2. Since the first snapshot would be invalid, we adjust the total snapshot count to 5, where the first 4 are daily snapshot, and the last one is 4 days after the fourth snapshot. This way we would have the same number of valid snapshots like other articles.