ArchiveTeam / yahooanswers-grab

Saving all questions and answers from Yahoo! Answers.
The Unlicense
50 stars 6 forks source link

Can we revive this warrior project? #3

Closed pirate closed 3 years ago

pirate commented 3 years ago

With the recent news about Yahoo Answers shutting down soon, it seems like it would make sense to revive this project and get people running it on their warriors.

https://www.reddit.com/r/Archiveteam/comments/mkos7o/yahoo_answers_to_shut_down_may_4_2021/

https://wiki.archiveteam.org/index.php/Yahoo!_Answers

What are the blockers to doing that? Did people run into issues last time this was used that need to be resolved?


In the meantime for people arriving here wanting to help, here's how you can run an ArchiveWarrior in 30sec that will automatically help once this scraper is ready to go.

docker run -p 8001:8001 archiveteam/warrior-dockerfile

Use the dashboard http://127.0.0.1:8001 to set up your worker id and an optional user/pass to lock it down. Set the active project to "auto" so that it will start working on yahoo automatically.

(These instructions are elsewhere too, I'm just mirroring them here to encourage people arriving here from forum links / google to contribute to the archiving effort by showing how easy it is.)

JakoDel commented 3 years ago

+1

ajayyy commented 3 years ago

See https://webirc.hackint.org/#irc://irc.hackint.org/#noanswers

Arkiver2 commented 3 years ago

this is being rewritten

3LeggedCat commented 3 years ago

I checked the ArchiveTeam wiki and while there is some info about this project (Yahoo Answers Grab), there is nothing whatsoever about archiving the Chiebukuro answers (Japan Yahoo Ansers); I believe that they also are a very important part of internet history, specifically Japanese internet culture, in addition to be a invaluable source of information about Japan, given that the userbase mantains a pretty good standard of answers in their site; I want to know if this project (yahooanswers-grab) can be used for the Chiebukuro as well, or if we can do something about it, again, after the fact that the ArchiveTeam entry for the Chiebukuro states that there is no project ongoing for it

SODAISpod commented 3 years ago

Does global yahoo answers threads covered in this project like "tw.answers.yahoo.com"? it seems like it has same structure with answers.yahoo.com. It can sill retrieve threads even I took "tw" away (but it shows in English interface). I'm worrying if this project still could index threads under "tw" region? thanks for the work!

Arkiver2 commented 3 years ago

@SODAIS69 yes, we will be getting the different regions. from the previous project these were 'ar', 'au', 'br', 'ca', 'fr', 'de', 'in', 'id', 'it', 'malaysia', 'mx', 'nz', 'ph', 'qc', 'sg', 'tw', 'es', 'th', 'uk', 'vn', 'espanol'

Arkiver2 commented 3 years ago

@3LeggedCat is chiebukuro.yahoo.co.jp going away? that websites does seem to have a very different stucture, so it'll likely not be included in this project if it's not going away.

I'd be happy to consider it for a future project.

3LeggedCat commented 3 years ago

@3LeggedCat is chiebukuro.yahoo.co.jp going away? that websites does seem to have a very different stucture, so it'll likely not be included in this project if it's not going away.

I'd be happy to consider it for a future project.

If the ArchiveTeam consider the Chiebukuro for their next project, that would be awesome actually! For now there is no notice of they going away, but I remember how hastily things were back in 2019 when they announced the takedown of Japan Geocities, we got 10 years to prepare for that one and we couldn't act on time (to archive it fully), I really don't want a repeat of that

TomGlass commented 3 years ago

@3LeggedCat is chiebukuro.yahoo.co.jp going away? that websites does seem to have a very different stucture, so it'll likely not be included in this project if it's not going away. I'd be happy to consider it for a future project.

If the ArchiveTeam consider the Chiebukuro for their next project, that would be awesome actually! For now there is no notice of they going away, but I remember how hastily things were back in 2019 when they announced the takedown of Japan Geocities, we got 10 years to prepare for that one and we couldn't act on time (to archive it fully), I really don't want a repeat of that

Regarding Chiebukuro if the site is known to be dying shortly please add it to our deathwatch page https://wiki.archiveteam.org/index.php/Deathwatch

As pirate says in the start the best option is to run the warrior-dockerfile and leave it to ArchiveTeams choice, I will move it to this project once the code is ready :)

TomGlass commented 3 years ago

We're nearly ready to go so closing this issue. Any further discussion is best on #noanswers