IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 46 forks source link

Topic downloader #109

Closed lennier1 closed 4 years ago

lennier1 commented 4 years ago

I've implemented a topic downloader which is a significantly faster way to download all a group's messages, particularly if there are messages with many replies. This needs larger scale testing. It might fail on bad HTML. (Maybe fall back to regular one message at a time downloading then?)

@IgnoredAmbience @d235j @nsapa @philpem

lennier1 commented 4 years ago

Thought on processing via topic id ... since this goes backwards in time my guess is that it will be picking up the spam first. (for groups that have been heavily spammed) That might not be an issue - but if the discussions about stopping/pausing processing a group when bulk spam is reached result in a decision to pause/stop/skip a group then it would not work in this case. The idea about pausing a group fetch when lots of spam reached was to allow moving to the next group to try to maximise the amount of "clean" mail fetched. Could then go back (if state stored per group) and resume a group when all else done if there is still time left before Y! kills history.

Could the loop be changed to go forwards in time from oldest topic? I think that the next/prev links might be based on the most recent message in the topic so cannot simply start from topic 0/1 and follow the "next" link. But simply starting from 0/1 and skipping ones that are not found might be OK unless there is easy way to walk through the links backwards to the start without too much overhead.

That makes sense. I'm including this in my new pull request.

lennier1 commented 4 years ago

Latest updates include starting with earlier messages and better error checking/correction.

lennier1 commented 4 years ago

Added -tr option if you want to get raw messages as well. It's slower than -t, but faster than -e and should get all the data.

IgnoredAmbience commented 4 years ago

This PR seems to also have pulled in the archiveteam warrior frontend, I might have to pick through the commits to just pull out the topic downloader and any other fixed included.

IgnoredAmbience commented 4 years ago

This PR has much more on it than the topic downloader now. Will resync with the archive team fork separately.