Closed antonizoon closed 9 years ago
It seems to work alright here. Mind installing the latest basc_py4chan
/basc_archiver
from git, and then trying again?
Is there a specific error you're getting? If updating basc_py4chan
/basc_archiver
fixes it I'll package them into new pypi releases and push them.
Ok, so it actually worked, nothing was wrong. It grabbed the API and HTML, and now that Hiroyuki has learned to post images, some of those as well.
The only issue is that it doesn't track the thread. This is an extremely special case where a closed thread actually does get updated posts.
Maybe it would be good to have a --force-update
option, to track a thread even after it has been closed/archived? The archiver will still wait if there are no new changes.
Ah, that makes sense. We can make that option without an issue, it'd be good to have.
Thanks. Awesome that you added child thread support! I will also try to work on creating pyFuuka, py420chan and py8chan/vichan, which will be prerequisites for thread archival on those sites.
I remember you had the GUI branch, which worked quite well as a proof of concept. GUIs are important to Windows users, so maybe we should get it active?
In particular, I need pyFuuka to attempt to Archive the Fuuka Archivers, in a slow, months long, polite process, (using the FoolFuuka API). Even the largest is only in the range of terabytes, and we have stuff like Amazon Glacier and Google Cloud Storage Nearline, which stores gigabytes in tape storage for a cent per GB. E.g. I grabbed the 250GB pruned images from 4plebs, who was nice enough to let me back it up fast before he ran out of space.
Yeah, that GUI branch is still working alright iirc, though I haven't had a good look at it for a while.
The next thing I want to tackle is making the threaded
branch the master
branch -- it's looking fairly stable these days, so that should be good.
After that, I'll take another look at the GUI branch and make a release
with a beta GUI client.
On an unrelated note, could those 250GB of images go well being uploaded to Archive.org?
Thanks. Yeah, the threaded design will be important for scraping multiple threads on GUI, and especially any kind of web crawling, such as on the Fuuka Archives.
I remember someone uploaded Heinessen's /mlp/ 250GB full images as a tar to the Internet Archive, it certainly can happen (though I'm going to need to figure out how to use the Internet Archive's S3 API). I actually acquired an LTO3 automated tape library on the cheap recently, so on-site archival (400/800GB for $17, organized by robot laser sorting arm) will probably happen first.
For some reason I am unable to scrape this thread. It seems to be an extremely special case where the thread is closed to general posting, and only the admins themselves can post (such as stickies).
I think the issue can be solved if we ask the user if they want to scrape a closed/archived thread. Or maybe some other way? Maybe detect if it is a sticky?
http://boards.4chan.org/qa/thread/183913/answers-thread-qa-session-with-hiroyuki-nishimura