IgnoredAmbience / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
93 stars 45 forks source link

Add fetching of Topics functionality #47

Closed tarilabs closed 4 years ago

tarilabs commented 5 years ago

I have used this functionality to retrieve "Topics" data from my Y! groups.

As there is no documentation for the API, but I reckon the messageId is used to create also a "Topics" entry, which will list all the available replies in the conversation, as seen up-to the given messageId. This is my understanding.

Hence this functionality uses code to retrieve again the list of messageId(s) which is iterated again but this time to retrieve all possible entries against the "Topics" REST endpoint.

I have tried to reverse-engineer first using the "summary" page on the GET topics/ REST endpoint itself, but it was more complex and I was not able to get good results; while this current implementation given at least in my case a better snapshot. I reckon there might be more "duplicates", as this would fetch the "status" of the Topics as seen when messageId has arrived; anyhow I preferred to obtain multiple snapshots of what is semantically the same Topics at multiple points in time, rather then continuing the attempt to reverse engineer the main GET topics/ REST endpoint ;)

Please be patient with me during code review, as this is my first PR on a Python-based project :)

I hope this helps anyhow!

IgnoredAmbience commented 5 years ago

Thanks for the contribution. Out of interest, what in particular was problematic about the summary /topics page? Other similar summary endpoints in the API seem to work reasonably well. I think I'd prefer it if the index of topics was initially retrieved from here rather than reversing it from messages.

tarilabs commented 5 years ago

Pagination and total number of records indicators were not really clear to me. That is when I realized for each messageId it constructs a "topic" for all the replies as-seen-at-that-point-in-time. I hope I was able to convey what I mean :)

Sure, if you believe there is a better way and download the topics metadata just as seen "at today" it would be awesome too. I acted in a rush as I preferred to preserve my Y! Groups data as soon as possible, I am not sure what will happen Monday on those endpoints :) I also find it might be helpful to have a download of the snapshot of each topics at multiple point in time as replies came through.

As I then downloaded my data, I though worthy just to raise it here ;)

btw thank you for starting this project, it has been very useful !! :)

IgnoredAmbience commented 4 years ago

Superseded by code now in master.