IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 45 forks source link

Timeout for /messages requests with large numbers of returned messages #15

Closed dossy closed 5 years ago

dossy commented 5 years ago

After getting through the login issues thanks to #2's guidance to use the new (undocumented) -ct and -cy parameters, the script now dies with the following error:

Traceback (most recent call last):
  File "yahoo.py", line 192, in <module>
    archive_email(yga, reattach=(not args.no_reattach), save=(not args.no_save))
  File "yahoo.py", line 29, in archive_email
    msg_json = yga.messages(count=count)
  File "/data2/yahoo-group-archiver/yahoogroupsapi.py", line 79, in get_json
    r = self.s.get(uri, params=opts, allow_redirects=False, timeout=10)
  File "/data2/yahoo-group-archiver/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/data2/yahoo-group-archiver/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/data2/yahoo-group-archiver/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/data2/yahoo-group-archiver/venv/local/lib/python2.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='groups.yahoo.com', port=443): Read timed out. (read timeout=10)

With a normal logged in browser session, I can access the JSON API endpoint and get a response.

https://groups.yahoo.com/api/v1/groups/extremeprogramming/messages

But, the script fails with the error above.

I tried extending the timeout to 120 seconds, in case the request was just taking longer than expected, but it eventually times out, still.

Suggestions?

cas206 commented 5 years ago

In the "About" tab, "Group Settings", is it a Public or Restricted group? I'm able to get this script to download from Public groups. I get the same Time Out message when trying to copy a Restricted group.

dossy commented 5 years ago

@cas206 public group.

Group Settings

  • This is a public group.
  • Attachments are not permitted.
  • Members cannot hide email address.
  • Listed in Yahoo Groups directory.
  • Membership does not require approval.
  • Messages from new members require approval.
  • All members can post messages.
dossy commented 5 years ago

@cas206 oh, interesting - the group info says "This is a public group." but when I actually go into the group's settings, I see:

image

The difference between "Public" and "Custom" with this group's settings is "Non Members can post messages" is unchecked in our custom settings. Could this really be the reason why it's failing?

I'm going to temporarily set the group type to actual "Public" and see if the archiver succeeds.

cas206 commented 5 years ago

Ignore my comment. I had two Public groups work and 3 Restricted fail with Time out error. However, the fourth one I attempted was Restricted and it's downloading.

dossy commented 5 years ago

@cas206 I set the group to plain "Public" and get the timeout, still.

How old are the groups you're archiving? The one I'm working on was founded Dec 31, 1999. I wonder if this has something to do with it ...

dossy commented 5 years ago

I added some debugging pprints and the URL it's trying to request that it's timing out on is: https://groups.yahoo.com/api/v1/groups/extremeprogramming/messages?count=160472

I'm guessing that's so far back in the past that it's trying to access data that's no longer available ...

Kuipo commented 5 years ago

I'm also getting this error. The group is very old and may also be running into old content that's not available as well @dossy I'm not sure if there's a solution to this.

dossy commented 5 years ago

I implemented pagination in the script, fetching 1,000 messages at a time ... the script is now running. If this works to pull all 160,472 messages out of my group, I'll submit a PR.

cas206 commented 5 years ago

Ones that don't work are 1998 (46309 messages), 1999 (88664 messages). Ones that worked are 2000 (26351 messages), 2001 (17564 messages).

Another that works though is 1999 (9280 messages).

dossy commented 5 years ago

@cas206 I pushed up my add-pagination branch to my fork, if you want to give it a try on the larger groups.

https://github.com/dossy/yahoo-group-archiver/tree/add-pagination

cas206 commented 5 years ago

Working now. Good work.

Kuipo commented 5 years ago

@dossy Excellent work. Mine is now working on all 19,000+ posts. This... may take a while. Thank you!

cmilanf commented 5 years ago

Tried your fork (thank you!) and I was able to download ~12000 messages. Afterwards gave the error:

Traceback (most recent call last):
  File "./yahoo.py", line 200, in <module>
    archive_email(yga, reattach=(not args.no_reattach), save=(not args.no_save))
  File "./yahoo.py", line 44, in archive_email
    raw_json = yga.messages(id, 'raw')
  File "/YahooMigration2Google/yahoo-group-archiver/yahoogroupsapi.py", line 74, in get_json
    r = self.s.get(uri, params=opts, allow_redirects=False, timeout=10)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 533, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 520, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 630, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 521, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='groups.yahoo.com', port=443): Read timed out. (read timeout=10)
dossy commented 5 years ago

@cmilanf Try pulling from my branch again, I just added commit dossy@576e23ec that adds skipping of existing files, and setting arbitrary --start and --stop message IDs for fetching specific ranges. I use the --start to skip over messages that Y! can't fetch.

cmilanf commented 5 years ago

@cmilanf Try pulling from my branch again, I just added commit dossy/yahoo-group-archiver@576e23e that adds skipping of existing files, and setting arbitrary --start and --stop message IDs for fetching specific ranges. I use the --start to skip over messages that Y! can't fetch.

Great! After pulling your last commit I was able to continue fetching just were the error left me. Now continuing :)

marczellm commented 5 years ago

@IgnoredAmbience @dossy could you please pull (request) the large group support back to the main repo?

currawong1 commented 5 years ago

I get an error with a large group (200k+ messages) as below. I tried dossy's fork, but it outputs the email as .eml files, when I was looking for JSON output to convert to other formats.

Traceback (most recent call last): File "./yahoo.py", line 634, in archive_email(yga, save=(not args.no_save), html=args.html) File "./yahoo.py", line 58, in archive_email msg_json = yga.messages(count=count) File "yahoogroupsapi.py", line 99, in get_json r = self.s.get(uri, params=opts, allow_redirects=False, timeout=15) File "/Library/Python/2.7/site-packages/requests/sessions.py", line 546, in get return self.request('GET', url, kwargs) File "/Library/Python/2.7/site-packages/requests/sessions.py", line 533, in request resp = self.send(prep, send_kwargs) File "/Library/Python/2.7/site-packages/requests/sessions.py", line 646, in send r = adapter.send(request, **kwargs) File "/Library/Python/2.7/site-packages/requests/adapters.py", line 529, in send raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='groups.yahoo.com', port=443): Read timed out. (read timeout=15)

IgnoredAmbience commented 5 years ago

It is my intent to fix this tomorrow, ran out of time today I'm afraid.