Trouble scraping Hiroyuki's Q&A Sticky Thread

bibanon / BASC-Archiver

Python-based Imageboard (4chan) complete thread archiver.

https://pypi.python.org/pypi/BASC-Archiver/

134 stars 18 forks source link

Trouble scraping Hiroyuki's Q&A Sticky Thread #19

Closed antonizoon closed 9 years ago

antonizoon commented 9 years ago

For some reason I am unable to scrape this thread. It seems to be an extremely special case where the thread is closed to general posting, and only the admins themselves can post (such as stickies).

I think the issue can be solved if we ask the user if they want to scrape a closed/archived thread. Or maybe some other way? Maybe detect if it is a sticky?

http://boards.4chan.org/qa/thread/183913/answers-thread-qa-session-with-hiroyuki-nishimura

DanielOaks commented 9 years ago

It seems to work alright here. Mind installing the latest basc_py4chan/basc_archiver from git, and then trying again?

Is there a specific error you're getting? If updating basc_py4chan/basc_archiver fixes it I'll package them into new pypi releases and push them.

antonizoon commented 9 years ago

Ok, so it actually worked, nothing was wrong. It grabbed the API and HTML, and now that Hiroyuki has learned to post images, some of those as well.

The only issue is that it doesn't track the thread. This is an extremely special case where a closed thread actually does get updated posts.

Maybe it would be good to have a --force-update option, to track a thread even after it has been closed/archived? The archiver will still wait if there are no new changes.

DanielOaks commented 9 years ago

Ah, that makes sense. We can make that option without an issue, it'd be good to have.

antonizoon commented 9 years ago

Thanks. Awesome that you added child thread support! I will also try to work on creating pyFuuka, py420chan and py8chan/vichan, which will be prerequisites for thread archival on those sites.

I remember you had the GUI branch, which worked quite well as a proof of concept. GUIs are important to Windows users, so maybe we should get it active?

In particular, I need pyFuuka to attempt to Archive the Fuuka Archivers, in a slow, months long, polite process, (using the FoolFuuka API). Even the largest is only in the range of terabytes, and we have stuff like Amazon Glacier and Google Cloud Storage Nearline, which stores gigabytes in tape storage for a cent per GB. E.g. I grabbed the 250GB pruned images from 4plebs, who was nice enough to let me back it up fast before he ran out of space.

DanielOaks commented 9 years ago

Yeah, that GUI branch is still working alright iirc, though I haven't had a good look at it for a while.

The next thing I want to tackle is making the threaded branch the master branch -- it's looking fairly stable these days, so that should be good. After that, I'll take another look at the GUI branch and make a release with a beta GUI client.

On an unrelated note, could those 250GB of images go well being uploaded to Archive.org?

antonizoon commented 9 years ago

Thanks. Yeah, the threaded design will be important for scraping multiple threads on GUI, and especially any kind of web crawling, such as on the Fuuka Archives.

I remember someone uploaded Heinessen's /mlp/ 250GB full images as a tar to the Internet Archive, it certainly can happen (though I'm going to need to figure out how to use the Internet Archive's S3 API). I actually acquired an LTO3 automated tape library on the cheap recently, so on-site archival (400/800GB for $17, organized by robot laser sorting arm) will probably happen first.