IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
92 stars 46 forks source link

Error/ Crash on Download of Calendar #115

Closed DiagonalArg closed 4 years ago

DiagonalArg commented 4 years ago

In my just previously reported issue I noted that one group download failed at the end due to some error. That's not in the log since I only redirect standard output and not standard error. Here is what was in my terminal at the end:

2019-11-20 00:51:28.886 PST INFO archive_photos Fetching photo 'Franken-throid' (1/1)
2019-11-20 00:51:30.204 PST ERROR archive_db Couldn't access Database functionality for this group
2019-11-20 00:51:31.125 PST INFO archive_links Written 47 links from  folder
2019-11-20 00:51:32.532 PST ERROR YahooGroupsAPI Unknown 401 error for https://calendar.yahoo.com/ws/v3/users/@@groups_84ae9fc5-426f-426d-b9e8-457cee681320/calendars/events/?format=json&dtstart=20000101dtend=20000201&wssid=Dummy, giving up on this download
Traceback (most recent call last):
  File "yahoo.py", line 373, in archive_calendar
    yga.download_file(tmpUri)  # We expect a 403 or 401  here
  File "/home/user/Data/YGA-IA/yahoogroupsapi.py", line 134, in download_file
    r.raise_for_status()
  File "/home/user/.local/lib/python3.5/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://calendar.yahoo.com/ws/v3/users/@@groups_84ae9fc5-426f-426d-b9e8-457cee681320/calendars/events/?format=json&dtstart=20000101dtend=20000201&wssid=Dummy

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "yahoo.py", line 724, in <module>
    archive_calendar(yga)
  File "yahoo.py", line 378, in archive_calendar
    tmpJson = json.loads(e.response.content)['calendarError']
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
./dly.byid.sh: line 20: ut: command not found

As a result it failed to get polls or something (whatever comes after the calendar). So I suggest a gentler failure might be in order.

(I wish I could help. I don't do python.)

Lotus907efi commented 4 years ago

I got this same error when I try to run this utility.

2019-11-20 21:42:17.561 EST INFO urllib3.connectionpool Starting new HTTPS connection (1): groups.yahoo.com 2019-11-20 21:42:17.796 EST ERROR archive_email Couldn't access Messages functionality for this group 2019-11-20 21:42:18.050 EST ERROR archive_files Couldn't access Files functionality for this group 2019-11-20 21:42:18.344 EST ERROR archive_photos Couldn't access Photos functionality for this group 2019-11-20 21:42:18.597 EST ERROR archive_db Couldn't access Database functionality for this group 2019-11-20 21:42:18.851 EST ERROR archive_links Couldn't access Links functionality for this group 2019-11-20 21:42:19.301 EST INFO urllib3.connectionpool Starting new HTTPS connection (1): calendar.yahoo.com 2019-11-20 21:42:19.553 EST ERROR YahooGroupsAPI Unknown 401 error for https://calendar.yahoo.com/ws/v3/users/@@groups_445a44bc-e71d-4fbe-a9b6-d3e8601a0a7d/calendars/events/?format=json&dtstart=20000101dtend=20000201&wssid=Dummy, giving up on this download Traceback (most recent call last): File "../../src/yahoo-group-archiver/yahoo.py", line 373, in archive_calendar yga.download_file(tmpUri) # We expect a 403 or 401 here File "/media/cdrom/sad_backup/src/yahoo-group-archiver/yahoogroupsapi.py", line 134, in download_file r.raise_for_status() File "/usr/lib/python3/dist-packages/requests/models.py", line 773, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 401 Client Error: Unauthorized

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "../../src/yahoo-group-archiver/yahoo.py", line 724, in archive_calendar(yga) File "../../src/yahoo-group-archiver/yahoo.py", line 378, in archive_calendar tmpJson = json.loads(e.response.content)['calendarError'] File "/usr/lib/python3.4/json/init.py", line 312, in loads s.class.name)) TypeError: the JSON object must be str, not 'bytes'

lennier1 commented 4 years ago

Does this happen intermittently, or every time for these groups? If every time, are there any public groups where it happens?

I added a try/except to the json loading, which should keep it from crashing, but I'm not sure why they're getting an unexpected format there.

Lotus907efi commented 4 years ago

It turns out that when getting the cookies through a web browser that ampersands get turned into &amp in the text of the cookie.

So for instance something like '12345-789&yrte' becomes '12345-789\&yrte'

and I believe this is causing authentication problems. So you might want to add something to the README.md file to warn others about this problem.

DiagonalArg commented 4 years ago

Ok, @lennier1, I'm still getting an error using your script, and it happens every time I run the group. (I have no "&" appearing in my cookies.) The command is:

python3 yahoo.py -ct "$TCOOKIE" -cy "$YCOOKIE" -c "$listname"

The error:

/home/user/.local/lib/python3.5/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
  warnings.warn(warning, RequestsDependencyWarning)
Traceback (most recent call last):
  File "yahoo.py", line 907, in <module>
    with Mkchdir(args.group, sanitize=False):
  File "yahoo.py", line 772, in __enter__
    os.chdir(self.d)
FileNotFoundError: [Errno 2] No such file or directory: ''

It then just freezes at that point and I have to ^C out of the command. I don't know if there are other groups where this happens. This one is closed. (For right now, at least, I'm focusing on the half dozen groups that we need, some very large, so I don't have a broad enough spectrum to see how common it is.)

DiagonalArg commented 4 years ago

I'm being admonished that I'm not providing enough info for the bug. I don't see that I can add anything more than that I'm running on Ubuntu 16.04 and that the group is closed and hard to get into. There are people there worried about being outed. With the link to the full output here, I'm not sure what else I can add.

If there is some debug switch or other code you want me to run, I'm willing to do it.

marked commented 4 years ago

Include the output of: python3 -V pip3 freeze

If we don't figure it out this round, we'll have to capture the output you're getting, but I think we're narrowing it down.

sorry if I made you feel admonished

DiagonalArg commented 4 years ago

@lennier1 just asked on IRC, and I let him know that the calendar does not have anything on it (at least that I can see).

@marked - NP. No offense! Here's the output:

$ python3 -V
Python 3.5.2

$ pip3 freeze
aiohttp==3.4.0
apparmor==2.10.95
appdirs==1.4.3
apturl==0.5.2
async-timeout==1.2.1
attrs==18.1.0
Babel==2.5.0
beautifulsoup4==4.4.1
blinker==1.4
Brlapi==0.6.4
cairocffi==0.7.2
CairoSVG==1.0.19
certifi==2019.9.11
cffi==1.5.2
chardet==3.0.3
chrome-gnome-shell==0.0.0
colout==0.5
command-not-found==0.3
cryptography==1.2.3
defer==1.0.6
djvubind==1.2.1
fail2ban==0.9.3
feedparser==5.1.3
Flask==0.10.1
html5lib==0.999
httplib2==0.9.1
idna==2.7
idna-ssl==1.1.0
itsdangerous==0.24
Jinja2==2.8
language-selector==0.1
LibAppArmor==2.10.95
louis==2.6.4
lxml==3.5.0
Mako==1.0.3
MarkupSafe==0.23
maxminddb==1.4.1
mock==2.0.0
multidict==4.3.1
natsort==5.3.3
notify2==0.3
oauthlib==1.0.3
onionshare==2.1
pbr==3.0.1
pexpect==4.0.1
Pillow==3.1.2
ply==3.7
ptyprocess==0.5
pyasn1==0.1.9
pycparser==2.14
pycrypto==2.6.1
pycups==1.9.73
pycurl==7.43.0
Pygments==2.2.0
pygobject==3.20.0
pyinotify==0.9.6
PyJWT==1.3.0
pyOpenSSL==0.15.1
PyPDF2==1.25.1
PySocks==1.5.0
python-apt==1.1.0b1+ubuntu0.16.4.5
python-debian==0.1.27
python-systemd==231
pytz==2017.2
pyxdg==0.26
reportlab==3.3.0
requests==2.22.0
sessioninstaller==0.0.0
setproctitle==1.1.10
six==1.10.0
speedtest-cli==2.1.2
ssh-import-id==5.5
sshuttle==0.78.1
stem==1.4.1
stig==0.10.1a0
system-service==0.3
typing-extensions==3.7.2
ubuntu-drivers-common==0.0.0
ufw==0.35
unattended-upgrades==0.1
unity-scope-calculator==0.1
unity-scope-chromiumbookmarks==0.1
unity-scope-colourlovers==0.1
unity-scope-devhelp==0.1
unity-scope-firefoxbookmarks==0.1
unity-scope-gdrive==0.7
unity-scope-manpages==0.1
unity-scope-openclipart==0.1
unity-scope-texdoc==0.1
unity-scope-tomboy==0.1
unity-scope-virtualbox==0.1
unity-scope-yelp==0.1
unity-scope-zotero==0.1
unity-tweak-tool==0.0.7
urllib3==1.25.7
urwid==1.3.1
urwidtrees==1.0.3.dev0
usb-creator==0.3.0
warcio==1.7.1
Werkzeug==0.10.4
xcffib==0.3.6
xdiagnose==3.8.4.1
xkit==0.0.0
yarl==1.2.6
zim==0.72.0
DiagonalArg commented 4 years ago

I got a calendar exception for another group. Unfortunately, this one is also closed. Here is the output from IgnoredAmbiance. Not sure if it shows you anything new. I'll also run @lennier1's and add the output from that in just a bit ... [Now added below.]

2019-11-21 09:23:21.958 PST ERROR YahooGroupsAPI Unknown 401 error for https://calendar.yahoo.com/ws/v3/users/@@groups_6ed6200a-4c7e-49a1-bf28-4e663126461d/calendars/events/?format=json&dtstart\
=20000101dtend=20000201&wssid=Dummy, giving up on this download
/home/dev/.local/lib/python3.5/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
  warnings.warn(warning, RequestsDependencyWarning)
Traceback (most recent call last):
  File "yahoo.py", line 373, in archive_calendar
    yga.download_file(tmpUri)  # We expect a 403 or 401  here
  File "/home/dev/Data/YGA-IA/yahoogroupsapi.py", line 134, in download_file
    r.raise_for_status()
  File "/home/dev/.local/lib/python3.5/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://calendar.yahoo.com/ws/v3/users/@@groups_6ed6200a-4c7e-49a1-bf28-4e663126461d/calendars/events/?format=json&dtstart\
=20000101dtend=20000201&wssid=Dummy

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "yahoo.py", line 724, in <module>
    archive_calendar(yga)
  File "yahoo.py", line 378, in archive_calendar
    tmpJson = json.loads(e.response.content)['calendarError']
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'

Ok, here is the output from @lennier1

2019-11-21 15:26:37.932 PST INFO archive_calendar Getting wssid. Expecting 401 or 403 response.
2019-11-21 15:26:38.870 PST ERROR YahooGroupsAPI Unknown 401 error for https://calendar.yahoo.com/ws/v3/users/@@groups_6ed6200a-4c7e-49a1-bf28-4e663126461d/calendars/events/?format=json&dtstart=20000101dtend=20000201&wssid=Dummy, giving up on this download
2019-11-21 15:26:38.870 PST ERROR archive_calendar ERROR: Couldn't load wssid exception to get calendarError.
Traceback (most recent call last):
  File "yahoo.py", line 596, in archive_calendar
    yga.download_file(tmpUri)  # We expect a 403 or 401  here
  File "/home/dev/Data/YGA-Len/yahoogroupsapi.py", line 135, in download_file
    r.raise_for_status()
  File "/home/dev/.local/lib/python3.5/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://calendar.yahoo.com/ws/v3/users/@@groups_6ed6200a-4c7e-49a1-bf28-4e663126461d/calendars/events/?format=json&dtstart=20000101dtend=20000201&wssid=Dummy

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "yahoo.py", line 602, in archive_calendar
    tmpJson = json.loads(e.response.content)['calendarError']
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
/home/dev/.local/lib/python3.5/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
  warnings.warn(warning, RequestsDependencyWarning)
Traceback (most recent call last):
  File "yahoo.py", line 947, in <module>
    archive_calendar(yga)
  File "yahoo.py", line 609, in archive_calendar
    if 'wssid' not in tmpJson:
UnboundLocalError: local variable 'tmpJson' referenced before assignment
Lotus907efi commented 4 years ago

Just out of curiosity, what happens if you add a calendar event to the group so the calendar is no longer empty? If you retry downloading the group then, does that make a difference?

DiagonalArg commented 4 years ago

[Note the update on my last post. I need to check if this group has any calendar events. I'll also check if a non-admin can add a calendar event.]

Oh, that's interesting ... I'm getting something totally different on this group. When I try to go to the calendar in the web interface, I get a notice: "Your access to this calendar is being processed. Please check back later."

lennier1 commented 4 years ago

OK, this commit should (hopefully) prevent the crash. You won't get the calendar of course, though it's unclear if there ever actually is a calendar when this happens.

lennier1 commented 4 years ago

https://github.com/lennier1/yahoo-group-archiver/commit/4e4081db442ebb2765f977e076f452193650e0c6

DiagonalArg commented 4 years ago

@lennier1 - Ok, I'm still getting the error, but now I'm also getting it on an open group. When I go to the calendar on any of my groups that have calendar access (some have it greyed out), I'm seeing, "Your access to this calendar is being processed. Please check back later," but I am able to access everything else. The open group that has this issue and that is producing the error is Autism-Mercury. It was not producing this error previously, and previously I was able to see the calendar in the web interface (even if it looked like there was nothing in it).

Try this:

python3 yahoo.py -ct "$TCOOKIE" -cy "$YCOOKIE" -c Autism-Mercury

2019-11-21 19:15:35.523 PST INFO archive_calendar Getting wssid. Expecting 401 or 403 response.
2019-11-21 19:15:36.486 PST ERROR YahooGroupsAPI Unknown 401 error for https://calendar.yahoo.com/ws/v3/users/@@groups_4942bfd0-3daa-433d-b569-ab48243a1ac4/calendars/events/?format=json&dtstart=20000101dtend=20000201&wssid=Dummy, giving up on this download
2019-11-21 19:15:36.487 PST ERROR archive_calendar ERROR: Couldn't load wssid exception to get calendarError.
Traceback (most recent call last):
  File "yahoo.py", line 596, in archive_calendar
    yga.download_file(tmpUri)  # We expect a 403 or 401  here
  File "/home/dev/Data/YGA-Len/yahoogroupsapi.py", line 135, in download_file
    r.raise_for_status()
  File "/home/dev/.local/lib/python3.5/site-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://calendar.yahoo.com/ws/v3/users/@@groups_4942bfd0-3daa-433d-b569-ab48243a1ac4/calendars/events/?format=json&dtstart=20000101dtend=20000201&wssid=Dummy

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "yahoo.py", line 602, in archive_calendar
    tmpJson = json.loads(e.response.content)['calendarError']
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
/home/dev/.local/lib/python3.5/site-packages/requests/__init__.py:83: RequestsDependencyWarning: Old version of cryptography ([1, 2, 3]) may cause slowdown.
  warnings.warn(warning, RequestsDependencyWarning)
lennier1 commented 4 years ago

I can download the Autism-Mercury calendar with -c with no issues, and get events. Maybe just a temporary problem?

IgnoredAmbience commented 4 years ago

I think this bug is now fixed in master thanks to @lennier1's change being merged. Please reopen if it reoccurs.