csaftoiu / yahoo-groups-backup

A python script to backup the contents of private Yahoo! groups.
The Unlicense
37 stars 17 forks source link

AssertionError - assert stripped_name.endswith(""") #35

Open lancesnead opened 8 years ago

lancesnead commented 8 years ago

I get the occasional Assertion Error when scraping messages from my forum that faults the script. I was able to scrape the first 12,000 of my 14K+ messages before it occurred. My only workaround was to delete the original message on Yahoo servers to continue the scrape. Here is the error log:

         'Received: from [66.218.67.136] by n15.grp.scd.yahoo.com with '
         'NNFMP; 23 Feb 2003 03:31:01 -0000\r\n'
         'Date: Sun, 23 Feb 2003 03:30:59 -0000\r\n'
         'To: texasmountaineers@yahoogroups.com\r\n'
         'Subject: e-rock\r\n'
         'Message-ID: <b39f9j+tg97@eGroups.com>\r\n'
         'User-Agent: eGroups-EW/0.82\r\n'
         'MIME-Version: 1.0\r\n'
         'Content-Type: text/plain; charset=ISO-8859-1\r\n'
         'Content-Length: 258\r\n'
         'X-Mailer: Yahoo Groups Message Poster\r\n'
         'From: "kelly <nurkpb@charter.net>" '
         '<nurkpb@charter.net>\r\n'
         'X-Originating-IP: 67.95.3.66\r\n'
         'X-Yahoo-Group-Post: member; u=105309409\r\n'
         'X-Yahoo-Profile: nurkpb\r\n'
         '\r\n'
         'hey guys\n'
         'i am going to the e-rock climb\n'
         'i would love to leave fri pm and carpool with anyone\n'
         'i live off 287 & Debbie Ln in Mansfield, it's on the way :)\n'
         'anyone interested?\n'
         'please let me know\n'
         'thanks\n'
         'kelly\n'
         'hm 817-453-9557\n'
         'cell 817-271-8596\n'
         'nurkpb@charter.net\n'
         '\n'
         '\n'
         '\n',

 'replyTo': 'SENDER',
 'senderId': '0JhJdOkNdI_n8iu5_u_2RoTIEuZdDtsmmD7dvc9nr6I-bGg2BJwlrQfYtCFjPiNvJW_oxNVzVqmQRfV3Ml8zyM94kZAT3VnhmuZ3MiMhoSes6VqBYEohqg',
 'spamInfo': {'isSpam': False, 'reason': '0'},
 'specialLinks': [],
 'subject': 'e-rock',
 'systemMessage': False,
 'topicId': 2103,
 'userId': 105309409}
Traceback (most recent call last):
  File "C:\Temp\yahoo-groups-backup\yahoo-groups-backup.py", line 129, in <module>
    main()
  File "C:\Temp\yahoo-groups-backup\yahoo-groups-backup.py", line 125, in main
    arguments, cfg_args)
  File "C:\Temp\yahoo-groups-backup\yahoo-groups-backup.py", line 103, in invoke_subcommand
    return module.command(args)
  File "C:\Temp\yahoo-groups-backup\yahoo_groups_backup\subcommands\scrape_messages.py", line 50, in command
    msg = scraper.get_message(cur_message)
  File "C:\Temp\yahoo-groups-backup\yahoo_groups_backup\scraper.py", line 180, in get_message
    return self._massage_message(data)
  File "C:\Temp\yahoo-groups-backup\yahoo_groups_backup\scraper.py", line 109, in _massage_message
    assert stripped_name.endswith("&quot;")
AssertionError
C:\Temp\yahoo-groups-backup>yahoo-groups-backup.py scrape_messages TEXASMOUNTAINEERS
Using '--mongo-port' from config file
Using '--mongo-host' from config file
Using '--password' from config file
Using '--login' from config file
Processing the log-in page...
Inserted message #14325 by Kevin Dahlstrom/None/kdahlstrom@gmail.com
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Skipped 1000 messages we already processed
Message #2103 is missing
Failed to process message:
{'authorName': 'ajfreeman2002 &lt;perrodepaz@earthlink.net&gt;',
 'canDelete': True,
 'contentTrasformed': False,
 'from': '&quot;ajfreeman2002 &lt;perrodepaz@earthlink.net&gt;&quot; '
         '&lt;perrodepaz@earthlink.net&gt;',
 'headers': {'messageIdInHeader': 'PGIzOTExditncGc3QGVHcm91cHMuY29tPg=='},
 'messageBody': '<div id="ygrps-yiv-960954217">Will there be top ropes set up '
                'at Enchanted Rock? or is it all lead <br/>\n'
                'climbing? <br/>\n'
                '<br/>\n'
                'We are interested if it is top roping.<br/>\n'
                'Ardis</div>',
 'msgId': 2102,
 'msgSnippet': 'Will there be top ropes set up at Enchanted Rock? or is it all '
               'lead climbing? We are interested if it is top roping. Ardis',
 'nextInTime': 2105,
 'nextInTopic': 0,
 'numMessagesInTopic': 1,
 'postDate': 1045956479,
 'prevInTime': 2101,
 'prevInTopic': 0,
 'profile': 'ajfreeman2002',
 'rawEmail': 'Return-Path: &lt;perrodepaz@earthlink.net&gt;\r\n'
             'X-Sender: perrodepaz@earthlink.net\r\n'
             'X-Apparently-To: texasmountaineers@yahoogroups.com\r\n'
             'Received: (EGP: mail-8_2_3_4); 22 Feb 2003 23:28:00 -0000\r\n'
             'Received: (qmail 95895 invoked from network); 22 Feb 2003 '
             '23:28:00 -0000\r\n'
             'Received: from unknown (66.218.66.216)\n'
             '  by m5.grp.scd.yahoo.com with QMQP; 22 Feb 2003 23:28:00 '
             '-0000\r\n'
             'Received: from unknown (HELO n21.grp.scd.yahoo.com) '
             '(66.218.66.77)\n'
             '  by mta1.grp.scd.yahoo.com with SMTP; 22 Feb 2003 23:28:00 '
             '-0000\r\n'
             'Received: from [66.218.67.162] by n21.grp.scd.yahoo.com with '
             'NNFMP; 22 Feb 2003 23:28:00 -0000\r\n'
             'Date: Sat, 22 Feb 2003 23:27:59 -0000\r\n'
             'To: texasmountaineers@yahoogroups.com\r\n'
             'Subject: enchanted rock\r\n'
             'Message-ID: &lt;b3911v+gpg7@eGroups.com&gt;\r\n'
             'User-Agent: eGroups-EW/0.82\r\n'
             'MIME-Version: 1.0\r\n'
             'Content-Type: text/plain; charset=ISO-8859-1\r\n'
             'Content-Length: 126\r\n'
             'X-Mailer: Yahoo Groups Message Poster\r\n'
             'From: &quot;ajfreeman2002 &lt;perrodepaz@earthlink.net&gt;&quot; '
             '&lt;perrodepaz@earthlink.net&gt;\r\n'
             'X-Originating-IP: 65.56.122.36\r\n'
             'X-Yahoo-Group-Post: member; u=126064000\r\n'
             'X-Yahoo-Profile: ajfreeman2002\r\n'
             '\r\n'
             'Will there be top ropes set up at Enchanted Rock? or is it all '
             'lead \n'
             'climbing? \n'
             '\n'
             'We are interested if it is top roping.\n'
             'Ardis\n'
             '\n'
             '\n',
 'replyTo': 'SENDER',
 'senderId': 'Z7i25P28BJENDnHwomyWIDjJh2nCRNcAExYizy-R4tFhTCcVvL_912yGWz279n7YJVL9UUtm6R_tSi9PqUWnHdUtsoTV9Qa09WZrMzWG0A7UUS_VL8AnG-i-xbZzbBzxzCIsLk8_ia7RZiAr',

 'spamInfo': {'isSpam': False, 'reason': '0'},
 'specialLinks': [],
 'subject': 'enchanted rock',
 'systemMessage': False,
 'topicId': 2102,
 'userId': 126064000}
Traceback (most recent call last):
  File "C:\Temp\yahoo-groups-backup\yahoo-groups-backup.py", line 129, in <module>
    main()
  File "C:\Temp\yahoo-groups-backup\yahoo-groups-backup.py", line 125, in main
    arguments, cfg_args)
  File "C:\Temp\yahoo-groups-backup\yahoo-groups-backup.py", line 103, in invoke_subcommand
    return module.command(args)
  File "C:\Temp\yahoo-groups-backup\yahoo_groups_backup\subcommands\scrape_messages.py", line 50, in command
    msg = scraper.get_message(cur_message)
  File "C:\Temp\yahoo-groups-backup\yahoo_groups_backup\scraper.py", line 180, in get_message
    return self._massage_message(data)
  File "C:\Temp\yahoo-groups-backup\yahoo_groups_backup\scraper.py", line 109, in _massage_message
    assert stripped_name.endswith("&quot;")
AssertionError
csaftoiu commented 8 years ago

Ah wow, that's due to a very special "From" field, namely:

&quot;kelly &lt;nurkpb@charter.net&gt;&quot; &lt;nurkpb@charter.net&gt;

Or, when unescaped:

"kelly <nurkpb@charter.net>" <nurkpb@charter.net>

It didn't expect to see nested emails (an email with < inside the quotes).

Thanks for the bug report. I know enough to do a bug fix for this now.