IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

Infinite loop on 404 error #107

Closed minimeh closed 4 years ago

minimeh commented 4 years ago

Python 2.7.17 on Windows 10 Pro version 1903. Using current head:

git show commit bcfb7a1cc4bbd59bcf9045b83210938b10e43fba (HEAD -> master, origin/master, origin/HEAD) Author: Thomas Wood <grand.edgemaster@gmail.com> Date: Wed Nov 6 01:37:50 2019 +0000

After 2000+ iterations of the following two lines where only the timestamp changed:

`2019-11-08 14:28:45.756 Pacific Standard Time ERROR YahooGroupsAPI Unknown 404 error for https://xa.yimg.com/kq/groups/QQ.u78DtedNG.KgO_Q--/or/POxuTHLvc9LYh9RyF2w-/name/open.php, giving up on this download

2019-11-08 14:28:45.756 Pacific Standard Time ERROR process_single_attachment ERROR downloading 'https://xa.yimg.com/kq/groups/QQ.u78DtedNG.KgO_Q--/or/POxuTHLvc9LYh9RyF2w-/name/open.php' variant or: 404 Client Error: Not Found for url: https://xa.yimg.com/kq/groups/QQ.u78DtedNG.KgO_Q--/or/POxuTHLvc9LYh9RyF2w-/name/open.php `

I finally noticed the infinite loop. I pressed Ctrl+C to break, and got the following:

`Traceback (most recent call last):

File "[redacted]\yahoo-group-archiver\yahoo.py", line 709, in archive_email(yga, message_subset=args.ids, start=args.start, stop=args.stop)

File "[redacted]\yahoo-group-archiver\yahoo.py", line 152, in archive_email archive_message_content(yga, id, status)

File "[redacted]\yahoo-group-archiver\yahoo.py", line 112, in archive_message_content process_single_attachment(yga, html_json['attachmentsInfo'])

File "[redacted]\yahoo-group-archiver\yahoo.py", line 188, in process_single_attachment yga.download_file(photoinfo['displayURL'], f=f)

File "[redacted]\yahoo-group-archiver\yahoogroupsapi.py", line 116, in download_file time.sleep(self.min_delay)

KeyboardInterrupt `

It seems it there is no "giving up on this download" when there clearly should be.

foghawk commented 4 years ago

I also have this error on latest master (for the first attachment to this message).

https://github.com/IgnoredAmbience/yahoo-group-archiver/blob/bcfb7a1cc4bbd59bcf9045b83210938b10e43fba/yahoo.py#L190-L198

What's the rationale for commenting out the exclude here? (Or for continuing at the end of the for-loop, for that matter?) It wasn't obvious to me from the git history.

It seems worth factoring out this photoinfo pattern. archive_about has something similar that likewise never actually touches exclude (hence #104); archive_photos continues through the photos list instead of spinning or crashing, but AFAICT doesn't fall back to smaller sizes if the original 404s.

(ping #25)

IgnoredAmbience commented 4 years ago

I believe this should be fixed in master. Apologies.