IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

json to eml converter #55

Open IgnoredAmbience opened 4 years ago

IgnoredAmbience commented 4 years ago

Reintroduce conversion from json messages/attachments/photos to eml files. The bulk of the code is in the commit history, just needs extracting out to a separate tool.

jmeile commented 4 years ago

I found this peace of code in another Yahoo archiver:

import email

... some code ...

    for message in msg_json['messages']:
        id = message['messageId']

        print "* Fetching raw message #%d of %d" % (id,count)
        raw_json = None
        for i in range(5):
            try:
                raw_json = yga.messages(id, 'raw')
                break
            except requests.exceptions.ReadTimeout:
                print "ERROR: Read timeout, retrying"
                time.sleep(HOLDOFF)
            except requests.exceptions.HTTPError as err:
                if err.response.status_code == 500:
                    print "ERROR: HTTP error %d reading the message... given up :(" % err.response.status_code
                    continue

        if raw_json is None:
            print "ERROR: given up on this message, moving on"
            continue

        mime = unescape_html(raw_json['rawEmail']).encode('latin_1', 'ignore')

        eml = email.message_from_string(mime)

I got it from here: https://github.com/philpem/yahoo-group-archiver/blob/master/yahoo.py

I have a working version (I don't know if it is the same repository) that will actually get the eml messages.

IgnoredAmbience commented 4 years ago

Yes, this code was originally in this repository, however it was removed as it was causing crashes during the main archive loop. It is this code that I was referring to that should be extracted out to a new tool.

nsapa commented 4 years ago

The commit that removed this functionality was cefc51bda0bdea2bf64216c8223eb0714e42f018 But it was already seriously broken at this time. It should still be working with it main issue (modifing broken encoding) at commit 22d9317c5d706147269bfd1a0ecbaead5a536824

ugcheleuce commented 4 years ago

I'm sure whatever you guys write will be infinitely superior to what I can put together, but in the mean time, I use a little AutoIt script to convert the JSON files to EML files which Thunderbird accepts as "real".

Sadi58 commented 4 years ago

@ugcheleuce Would you like to share that AutoIt script to convert the JSON files to EML files, and better still put together a multiplatform (java or python?) one?

ugcheleuce commented 4 years ago

@Sadi58 I can use Python and Java, but I can't program in it. I wasn't sure if linking to an AutoIt script was good manners, but since you ask, it's here: http://www.leuce.com/autoit/IA_JSON_2_EML.zip (newer versions always at the same download location). It's a bit slow, unfortunately (it takes about 10 minutes to create a list of 150 000 files to process, and then takes about 1 minute per 2500 files). Only ever tested on my own computer, too (Windows 10 Home 64).

Sadi58 commented 4 years ago

@ugcheleuce Thank you. Your script does not compete with the one here, but just complements it. :-) There's also a reference to a competing script above: https://github.com/philpem/yahoo-group-archiver It does more or less the same thing, and downloads message files in eml format instead of json. It's good to have such different options and alternatives. ;-)

n4mwd commented 4 years ago

I wrote a JSON to EML bulk conversion tool as a windows app that accepts the JSON files generated here and writes them into an "/EML" directory. It currently only writes raw EML files straight from the JSON file because I haven't had time to finish it. This means it currently does not reattach photos etc. I have not had time finish it due to too many hospital visits and other stuff, however, if all you want is a raw eml file,with no attachments, then it will work for that as is. The goal is to eventually convert all the files into a single XML file using BBCode, but that part isn't working right now. I can post it to github if anyone is interested.

IgnoredAmbience commented 4 years ago

I believe @PaulWebster is working on this functionality at present.

n4mwd commented 4 years ago

I don't know anything about his work. Mine only writes raw eml files at the moment. Other functionality is pending. https://github.com/n4mwd/YahooEml