Closed pirate closed 3 years ago
+1
this is being rewritten
I checked the ArchiveTeam wiki and while there is some info about this project (Yahoo Answers Grab), there is nothing whatsoever about archiving the Chiebukuro answers (Japan Yahoo Ansers); I believe that they also are a very important part of internet history, specifically Japanese internet culture, in addition to be a invaluable source of information about Japan, given that the userbase mantains a pretty good standard of answers in their site; I want to know if this project (yahooanswers-grab) can be used for the Chiebukuro as well, or if we can do something about it, again, after the fact that the ArchiveTeam entry for the Chiebukuro states that there is no project ongoing for it
Does global yahoo answers threads covered in this project like "tw.answers.yahoo.com"? it seems like it has same structure with answers.yahoo.com. It can sill retrieve threads even I took "tw" away (but it shows in English interface). I'm worrying if this project still could index threads under "tw" region? thanks for the work!
@SODAIS69 yes, we will be getting the different regions. from the previous project these were 'ar', 'au', 'br', 'ca', 'fr', 'de', 'in', 'id', 'it', 'malaysia', 'mx', 'nz', 'ph', 'qc', 'sg', 'tw', 'es', 'th', 'uk', 'vn', 'espanol'
@3LeggedCat is chiebukuro.yahoo.co.jp going away? that websites does seem to have a very different stucture, so it'll likely not be included in this project if it's not going away.
I'd be happy to consider it for a future project.
@3LeggedCat is chiebukuro.yahoo.co.jp going away? that websites does seem to have a very different stucture, so it'll likely not be included in this project if it's not going away.
I'd be happy to consider it for a future project.
If the ArchiveTeam consider the Chiebukuro for their next project, that would be awesome actually! For now there is no notice of they going away, but I remember how hastily things were back in 2019 when they announced the takedown of Japan Geocities, we got 10 years to prepare for that one and we couldn't act on time (to archive it fully), I really don't want a repeat of that
@3LeggedCat is chiebukuro.yahoo.co.jp going away? that websites does seem to have a very different stucture, so it'll likely not be included in this project if it's not going away. I'd be happy to consider it for a future project.
If the ArchiveTeam consider the Chiebukuro for their next project, that would be awesome actually! For now there is no notice of they going away, but I remember how hastily things were back in 2019 when they announced the takedown of Japan Geocities, we got 10 years to prepare for that one and we couldn't act on time (to archive it fully), I really don't want a repeat of that
Regarding Chiebukuro if the site is known to be dying shortly please add it to our deathwatch page https://wiki.archiveteam.org/index.php/Deathwatch
As pirate says in the start the best option is to run the warrior-dockerfile and leave it to ArchiveTeams choice, I will move it to this project once the code is ready :)
We're nearly ready to go so closing this issue. Any further discussion is best on #noanswers
With the recent news about Yahoo Answers shutting down soon, it seems like it would make sense to revive this project and get people running it on their warriors.
https://www.reddit.com/r/Archiveteam/comments/mkos7o/yahoo_answers_to_shut_down_may_4_2021/
https://wiki.archiveteam.org/index.php/Yahoo!_Answers
What are the blockers to doing that? Did people run into issues last time this was used that need to be resolved?
In the meantime for people arriving here wanting to help, here's how you can run an ArchiveWarrior in 30sec that will automatically help once this scraper is ready to go.
Use the dashboard http://127.0.0.1:8001 to set up your worker id and an optional user/pass to lock it down. Set the active project to "auto" so that it will start working on yahoo automatically.
(These instructions are elsewhere too, I'm just mirroring them here to encourage people arriving here from forum links / google to contribute to the archiving effort by showing how easy it is.)