anirvan / yahoo-group-archive-tools

Converts a Yahoo group archive created by yahoo-group-archiver into standalone email, mbox folders, and PDF files
MIT License
22 stars 2 forks source link

No attachments in .eml or .mbox files #2

Closed jnew-gh closed 4 years ago

jnew-gh commented 4 years ago

I used this tool on the output of IgnoredAmbience's yahoo-group-archiver and while all of the .eml and .mbox files are generated, there are no attachments. To be more specific, the emails show that there are attachments but none of the attachments exist.

Line 279 of yahoo-group-archive-tools.pl: $attachments_dir_path =~ s/_raw\.json$/_attachments/; For me this resolves to <source_directory>/email/xx_attachments but there are no xx_attachments directories in the .../email source directory.

The output of the tool shows a lot of the following: [<datetime>] [<groupname>] message xxxx: attachment named '<filename>' could not be found, skipping

Did the output file structure of IgnoredAmbience's yahoo-group-archiver change? Between the time I started using it (October 25, 2019) and today, all of the .../xxxx_attachments/ directories and xxxx.json files have been moved from the .../email directory to the .../topics directory (which didn't exist before). The contents of the .../attachments directory is also very different. The two datasets are downloaded from the same group.

Note: If I run this tool on the October 25th version of the data download, most of the attachments are found.

anirvan commented 4 years ago

@jnew-gh, thanks for flagging this.

I believe the location of the attachments change based on whether you're using yahoo-groups-archiver's -at (puts attachments under /attachments/), -e (which puts attachments under /email/), or -t (puts attachments under /topics/?) options. It's possible that the default may have changed over time. I've only used the first two options, but the third is obviously also a thing.

This doesn't seem like it would be hard to fix, as long as we can figure out the structure of the the /topics/[number]_attachments/ directories.

For example, let's say there exists /topics/5_attachment/. Is every attachment in that directory associated with email 5? Or are they associated with any of the emails in the topic that begins with email 5?

I'd appreciate any help figuring this out. Thank you!

anirvan commented 4 years ago

OK, I looked at yahoo-groups-archiver and as far as I can tell, it's saving the attachments under a directory keyed on the message ID, rather than the topic ID. Let me see if I can work with this.

Source: https://github.com/IgnoredAmbience/yahoo-group-archiver/blob/27ac4b1ae930e8f3804de9552d599bcc73f48eaf/yahoo.py#L381

jnew-gh commented 4 years ago

I'm fairly sure I didn't use any of the individual switches (-e, -at, -f, etc. ) but just used the default options when I ran both sets of data, like this: ./yahoo.py -ct '<T_cookie>' -cy '<Y_cookie>' '<groupid>'

anirvan commented 4 years ago

@jnew-gh, I think I fixed it as of the latest commit. I'd appreciate it if you could test this. Thank you!

jnew-gh commented 4 years ago

Still no attachments but there is progress.

If I throw in a couple of debug logfile output statements just after line 334: my $filename = $attachment_on_disk->filename;

I get, as an actual example:

[2019-52-14 09:52:55] DEBUG $attachments_dir_to_scan= /opt/yahoo-group-archive-tools-master/YahooGroup/topics/4287_attachments
[2019-52-14 09:52:55] DEBUG $filename= 1658543138-IMG_6465.JPG
[2019-52-14 09:52:55] [YahooGroup] message 4287: attachment named 'IMG_6465.JPG' could not be found, skipping

So, it now correctly finds the xxxx_attachments directories in the .../topics directory. But now it can't find IMG_6465.JPG because the actual filename is 1658543138-IMG_6465.JPG.

Further, file 1658543138-IMG_6465.JPG also exists in .../attachments/651627843/.

If I grep 1658543138 I get:

./topics/message_metadata_4.json
./topics/3097.json
./email/message_metadata_4.json
./attachments/651627843/attachmentinfo.json

If I grep 651627843 I get:

./topics/message_metadata_4.json
./topics/3097.json
./archive.log
./email/message_metadata_4.json
./attachments/651627843/attachmentinfo.json
./attachments/allattachmentinfo.json

If it helps, I've attached all the above files (as .txt files) to this post. (Note: ./topics/message_metadata_4.jsonand ./email/message_metadata_4.json are the same, and I've included only the relevant sections of some of the large files) ./topics/message_metadata_4.snipped.json.txt ./topics/3097.json.txt ./attachments/651627843/attachmentinfo.json.txt ./archive.log.txt ./attachments/allattachmentinfo.json.txt

As a suggestion, putting in the entire pathname in front of the skipped attachment filename in the log output would be immensely helpful.

anirvan commented 4 years ago

Can you try it now? I redid the way attachment handling works, and it works for all my own test cases. You can get more debugging messages if you call it with --noisy

jnew-gh commented 4 years ago

Partial success! Your update from 2 days ago still produced output with no attachments but your most recent update produced 23 emails with attachments (out of 207 possible). I get the "possible" total from the 207 xxxx_attachments folders in the .../topics folder. In the logfile, there are a lot of attachment successes ("attached file from...") but still a lot of failures ("attachment named 'xxxx' could not be found, skipping") on emails that have valid attachments

I can't make a final judgement because the import of the .mbox file into kmail (v5.10.3) choked on the 4793rd email (out of 7615). When I try to open the .mbox file in a text editor (kate), I get the error "The file .mbox was opened with UTF-8 encoding but contained invalid characters."

Examining the .mbox file, it seems as if all the messages are contained in it; but I'm guessing there are some invalid characters preventing the import from completing?

anirvan commented 4 years ago

Success! (I think)

I fixed two mistakes:

I fixed both of these issues in the latest rev of the code, and I'm hopeful that this will finally solve your problem.

Can you check?

anirvan commented 4 years ago

@jnew-gh, I split out your issue with the UTF8 encoding in as issue #6. Can you take a look at that?

gl31415 commented 4 years ago

Well done – that seems to work! All downloaded attachments now appear in the relevant eml files. (I initially thought that it'd failed again as I was still getting some warning messages, only to realise that the archive tool had failed to download some attachments.)

As far as I can tell, yahoo has already deleted my groups, despite their assurance that they wouldn't delete data till end January 2020. (Or at least, the archiver tool can no longer see any attachments for my groups.) Tragic.

Thanks again.