coandco / gtalk_export

Export Google Talk/Hangouts chats to logfiles, using both mbox IMAP folders (for Google Talk) and Takeouts exports (for Google Hangouts)
MIT License
43 stars 8 forks source link

mbox not well-formed? #1

Open silentguy256 opened 9 years ago

silentguy256 commented 9 years ago

I tried your tool and can't get it to work. At first I had python 3 working and that complains about wrong print formating... switched to 2.7.9 and got the following:

C:\Users\X\Downloads\gtalk_export-master>C:\Python27\python.exe gtalk_export.py -m Chats.mbox -j Hangouts.json -n"X X" -e X@gmail.com
Processing mbox file at Chats.mbox
Traceback (most recent call last):
  File "gtalk_export.py", line 139, in <module>
    parse_mbox(args.mbox_location, args.name, args.email, args.timestamp_format)
  File "gtalk_export.py", line 74, in parse_mbox
    chatxml = xml.dom.minidom.parseString(payload)
  File "C:\Python27\lib\xml\dom\minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 2, column 0

Any ideas? The mbox does look like a sensible mbox file. created it by using export folder from the recommended plugin using thunderbird portable.

silentguy256 commented 9 years ago

Kinda solved it myself... For some reason `´payload = re.sub("=\r\n", "", payload)`` failed even though it worked as expected if I use it in test code with similar texts... manually edited the file and everything went well

toni1727 commented 8 years ago

I have the same error. How did you solve? I don't understand well your solution.

Best regards.

coandco commented 8 years ago

Toni, it looks like silentguy manually did the regex replace on his chat XML before attempting to run it through the tool. I'm not sure why the regex failed for him.

toni1727 commented 8 years ago

This is the file, if you want to try:

https://www.dropbox.com/s/1r9cch0qyjebvh5/test.mbox?dl=0

Please tell me when you donwload and I will remove.

Best regards.

coandco commented 8 years ago

Sorry this went so long without a reply. I've just gotten around to looking into this, and it looks like the issue might be that Thunderbird exports don't always have consistent newlines. The code assumes everything will have the Windows \r\n, but sometimes they have the Unix \n. I've adjusted the regex to make the \r optional, and it seems to fix the mbox file I was testing against. Try pulling the latest version and running it again.

coandco commented 8 years ago

Also, I've converted the print statements to the Python 3 syntax, so it should theoretically work a bit better for that now.