disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

Error doing update on Dcard #83

Open pm5 opened 4 years ago

pm5 commented 4 years ago

Found an error in cronjob logs:

Traceback (most recent call last):
  File "./ns.py", line 124, in <module>
    main(args)
  File "./ns.py", line 96, in main
    update(args)
  File "./ns.py", line 80, in update
    newsSpiders.runner.update.run(runner, site["site_id"])
  File "/srv/web/newsSpiders/runner/update.py", line 67, in run
    site_id=site_id, current_time=current_time
  File "/srv/web/newsSpiders/runner/update.py", line 31, in get_posts_to_update
    for post in posts
  File "/srv/web/newsSpiders/runner/update.py", line 31, in <listcomp>
    for post in posts
  File "/srv/web/newsSpiders/runner/update.py", line 18, in get_last_comment_floor
    )["raw_data"]
TypeError: 'NoneType' object is not subscriptable

Looks like a problem getting the lastest floor of Dcard comments. This happens during the creation of spiders so an uncaught exception here will abort updates for all sites.

andreawwenyi commented 4 years ago

@pm5 looks like some of our dcard snapshots disappeared....

andreawwenyi commented 4 years ago

From the record of Article table, the last successful update is at Feb 28 13:51 TW time (unix time 1582869089)

sql command : SELECT * from Article WHERE next_snapshot_at != 0 AND snapshot_count != 1 ORDER BY last_snapshot_at DESC LIMIT 1

I think we have to do a full review to see what happened and how many snapshots are lost.....

pm5 commented 4 years ago

Note: we have backup up to 2/17. Working on data recovery 2/17-2/28