IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

Message Download Failures & Simple Procedure to Recover. #114

Closed DiagonalArg closed 4 years ago

DiagonalArg commented 4 years ago

Note that there are various reasons that downloads can fail. Here are a couple of smaller groups, for example:

2019-11-18 09:43:10.151 PST INFO archive_message_content Fetching  raw message id: 21183 (20612 of 20612)
2019-11-18 09:43:10.466 PST INFO archive_message_content Fetching html message id: 21183 (20612 of 20612)

$ ((ls ACC4Kids/email/ | tee /dev/fd/4 | egrep -E -e "^[[:digit:]]*\.json" | wc | sed 's/^/html: /' >&3 ) 4>&1 | egrep -E -e "^[[:digit:]]*_raw\.json" | wc | sed 's/^/raw:  /' >&3) 3>&1
html:   20611   20611  215900
raw:    20612   20612  298358

--------

2019-11-19 04:14:31.585 PST INFO archive_message_content Fetching  raw message id: 62576 (61924 of 61924)
2019-11-19 04:14:32.018 PST INFO archive_message_content Fetching html message id: 62576 (61924 of 61924)

((ls adult-metal-chelation/email/ | tee /dev/fd/4 | egrep -E -e "^[[:digit:]]*\.json" | wc | sed 's/^/html: /' >&3 ) 4>&1 | egrep -E -e "^[[:digit:]]*_raw\.json"
 | wc | sed 's/^/raw:  /' >&3) 3>&1                                                                                                                                                               
html:   61916   61916  670054
raw:    61919   61919  917763

Full output is at these links, but note that does not include standard error. (One of them crashed at the end when trying to get the calendar. I'll report that separately.)

https://framadrop.org/r/sHt00OBEu2#RvarM7guD7nS0ewHqTXG+jRHJWjeMadCspsW6LkASlM= https://framadrop.org/r/sHQBU7bVMk#4mkc7im3l7FjGXPCCdKtQFohRs2S7P1o1/Zs7oSF4IA= [Expires in 7 days.]

For the smaller of these two groups, the problem was a network error on my end; the other group I haven't looked at yet.

Right now I have to dig through the output to then try to redownload --ids. I really suggest that on rerun of the script for a group that's already been downloaded, it should not redownload already downloaded messages, files etc. There is no point in clobbering what is already there, and this will give an easy way to retry on failures.

DiagonalArg commented 4 years ago

Here is how I'm picking up messages that failed to download on a first pass. I do the initial run, and send the terminal output to "$listname".output. When it's done, I run:

#! /bin/bash
TCOOKIE="..."
YCOOKIE="..."

retrylist=$(egrep 'ERROR.*message' "$listname".output | awk '{ print $11 }' | uniq | tr '\n' ' ' | sed 's/ $//')

python3 yahoo.py -ct "$TCOOKIE" -cy "$YCOOKIE" --ids $retrylist -e "$listname" 

Of course if the run stopped due to a network error, for instance, then this script won't help. It needs every message to have been either downloaded or to have produced an ERROR in the terminal output.

IgnoredAmbience commented 4 years ago

The code in the latest master version of the tree should mostly only download content that's missing from an existing download. (I say mostly, as I think a few of the archive types are missing this functionality). Please reopen if this is not the case.