bbepis / Hayden

Ultra-low resource 4chan/altchan thread and board archiver
MIT License
62 stars 6 forks source link

Slow archiving after starting again #8

Closed Liz-chan closed 1 year ago

Liz-chan commented 2 years ago

Hello, I have recently been trying to archive 2chen's /tv/ board, and had to restart the program since it got stuck while archiving a thread, and now it's doing something that looks like this:

[9/9/2022 3:31:01 AM] [Image] [10/22043] [9/9/2022 3:31:48 AM] [Image] [20/22033] [9/9/2022 3:32:20 AM] [Image] [30/22023]

Which is taking a really long time. Is there a way to skip this and just continue archiving the board? At the rate it's going now, it's not going to be finished for at least a day.

Liz-chan commented 2 years ago

It seems that it's going a lot faster now, so not sure why it was so slow before. I'll still leave the issue up just in case you'd like to give an explanation for what the cache is for.

Are images not downloaded immediately or is it something different?

bbepis commented 1 year ago

Sorry, didn't see this in the flurry of github notifications I've received. What's happened is that Hayden saves any urls it encounters to a temporary cache for the exact scenario you're talking about, where it's either closed manually or terminates unexpectedly.

The thought process is that due to how Hayden works internally (where it saves thread information directly to the database, but not image information), if Hayden closes without recording the pending URLs somewhere then on next launch it's going to see the same thread as archived, even if not all images have been downloaded.

If you wish to drop all queued images even if it means Hayden won't detect them again, delete the hayden\imagequeue.db file

I've drastically changed how things work under the hood to the point where the cache isn't needed anymore, however I haven't had the time (or even energy, as I've caught covid) to clean up those changes and push them. Includes a database schema change so it's going to require some migration scripts too

Liz-chan commented 1 year ago

I think I might have to delete it since I have over 300k images in the queue right now and every time I start Hayden and try to quit it, it just gets stuck and doesn't quit properly, and therefore it can't update the queue

bbepis commented 1 year ago

To be honest it sounds like it's probably getting hammered trying to update/delete 300k rows in the imagequeue database. One of many several reasons why I don't use litedb anymore

Liz-chan commented 1 year ago

There's another backup tool I use for Discord that saves chat logs into HTML files, and you can optionally save the media in the chat as well, if you want it to all be available locally. Maybe something like that would be good for Hayden? Having a universal format for threads in an HTML file despite the content, and the media linking to the files that you have locally, so that you don't have to rely on a database. It makes it a lot more portable if each thread is contained in its own HTML file along with the media files. Another thing I noticed is that Hayden doesn't dedupe files, at least for 2chen, so if you were to go the HTML files route, maybe it'd be good to have them all in one folder/path like how the sites do it, since it does do native deduping with the filename being the SHA1 hash.

bbepis commented 1 year ago

There's another backup tool I use for Discord that saves chat logs into HTML files, and you can optionally save the media in the chat as well, if you want it to all be available locally. Maybe something like that would be good for Hayden? Having a universal format for threads in an HTML file despite the content, and the media linking to the files that you have locally, so that you don't have to rely on a database. It makes it a lot more portable if each thread is contained in its own HTML file along with the media files.

Hayden supports a Filesystem backend which saves threads (as a .json and image files) to individual folders. I picked .json specifically because it's machine-readable, consistent and I can transform it to whatever I wish later.

I do understand the convenience in having a .html version available too (I modified my own personal DiscordChatExporter instance to spit out both json and html in the same run), so I might support it in the future. However with the .json available, if you're so inclined and had enough motivation you could probably parse the json into a script that just spit out repeating blocks of HTML for each post in the json blob.

Another thing I noticed is that Hayden doesn't dedupe files, at least for 2chen, so if you were to go the HTML files route, maybe it'd be good to have them all in one folder/path like how the sites do it, since it does do native deduping with the filename being the SHA1 hash.

If I'm guessing correctly, it's something that I've covered a bit on the README: https://github.com/bbepis/Hayden/blob/3f78bceaddae0eadbbd498fbe312d1dc435c5aee/README.md#1

Hayden internally uses SHA256 so the only hashes it can properly trust are from imageboards that also supply SHA256 hashes. (Ideally I'd like to be somewhat paranoid about collisions; a lot of foolfuuka/asagi imageboards don't have 100% of images downloaded because people upload images with MD5 collisions, and as a result asagi won't redownload a hash it's already seen)

To get around it, I download each file even if I know the MD5. Once it has it in memory, it'll hash it to see if it knows the SHA256 or not. If not it'll discard it, otherwise saves it. As a result it looks like it doesn't dedupe because it has to download every image.

You can disable this behaviour with a config setting in my changes which I only have locally, which makes me look like a tool for getting this far in this reply without checking.

Condensing all of the images together into one folder for the flat-file backend sounds interesting, but it harms portability. Might make an option for it in the future


If you had any other questions or need help, feel free to send me a message on Bepis#5391 if you'd prefer to get some faster responses over IM

bbepis commented 1 year ago

Should be fixed, now using SQLite for state storage by default instead of LiteDB.

State store file (imagequeue.db) might need to be deleted upon upgrade