FerrahWolfeh / imageboard-downloader-rs

Cli utility to bulk download content from popular imageboard sites
MIT License
10 stars 1 forks source link

Program crashes after finding 201,000 posts #2

Closed KylarZeppeli closed 2 years ago

KylarZeppeli commented 2 years ago

The error message:

Found 201000 posts Error: Connection Error Caused by: 0: error decoding response body: expected value at line 1 column 1 1: expected value at line 1 column 1

I'm curious about the purpose of pre-finding the posts because it makes the command take a few minutes to start downloading with tags that have a large amount of posts, and if it's over 200,00 it just crashes apparently

FerrahWolfeh commented 2 years ago

Hmm, very curious about why it's getting a decode error with such an arbitrary number of posts, can you post here the full command so I can try to uncover this?

About the "pre-finding" of posts it's because after I made a giant refactor of the code (the purpose was to separate the actual CLI app from the library that does all the heavy-lifting), some functionalities are still not fully adapted to the new code, namely the ability to download posts while searching pages.

KylarZeppeli commented 2 years ago

What I posted was pretty much the whole thing, but here's the entire output of two commands I just ran:

imageboard_downloader -i "rule34" "Yaoi" Found 201000 postsError: Connection Error

Caused by: 0: error decoding response body: expected value at line 1 column 1 1: expected value at line 1 column 1

imageboard_downloader -i "rule34" "Futanari" Found 201000 postsError: Connection Error

Caused by: 0: error decoding response body: expected value at line 1 column 1 1: expected value at line 1 column 1

FerrahWolfeh commented 2 years ago

I tested your command and what seems to happen is that rule34's api is rate-limiting you while the program prescans the posts.

So, in order to make this work without the website limiting the calls, I'll need some time to rebuild part of the queue management. Maybe I can make it till 0.28 😅

Meanwhile, I'm setting a hard limit of 100 pages per run on track 0.27 just to not mini-ddos the imageboard.

If your use-case demands it, I recommend you to temporarily use any version before 0.27.0, since they don't have this issue.

FerrahWolfeh commented 2 years ago

I did something different. Starting from version 0.27.3, aside from the limit, you can now specify in what page you start. So, you can download the first 50 pages then run again starting from page 51, etc.

Just a piece of advice: try to download more specific tag combinations. Downloading this many posts at once is probably going to get your IP banned from the website

KylarZeppeli commented 2 years ago

I didn't mention it earlier but I actually didn't want to download anywhere near 200,000 posts. I always limit the downloads to around 1,000-10,000 with the download limiter every run. The command I ran to show the error was intentionally minimal without any extra flags. It's just that I couldn't download ANY AMOUNT because it had to scan all the posts with the tag on the website first, and it crashed before the scan finished.

FerrahWolfeh commented 2 years ago

No worries, I'm still in the process of adapting the lib to these situations. Soon I will finish the part that completely fixes this problem

FerrahWolfeh commented 2 years ago

Give it a shot on latest main starting from tag 0.27.4

It now should only prescan until the set limit.

In my tests it seemed to work fine

There's still a slight possibility that it might download less posts than the limit because of a configured blacklist. This is already fixed, but will only work in track 0.28

KylarZeppeli commented 2 years ago

I just found out that on the latest commit of the queue_refactor branch, when using simultaneous downloads, such as 10, almost all of the downloads randomly freeze for a few seconds & don't resume until the ones that aren't frozen finish. What's that about?

FerrahWolfeh commented 2 years ago

Yeah, I noticed that too. Might be something related to tokio or IO buffering(?). I will try to spawn a thread per download and see what happens then

FerrahWolfeh commented 2 years ago

Give it a try on latest queue_refactor.

I sorta fixed the stuttering problem (at the expense of some more ram usage).

The problem is somehow still there, just less noticeable and less frequent

FerrahWolfeh commented 2 years ago

Also, there's a possibility that the website is limiting the number of concurrent connections...

E621 downloads fine up to ~30 sim downloads

Rule34 the downloads start to freeze around 7-10 sim downloads

KylarZeppeli commented 2 years ago

Are you saying we can't get around that concurrent connection limit? What's up with the stuttering?

FerrahWolfeh commented 2 years ago

I've ran some versions of the download algorithm through a profiler to see if something was up.

What I saw was that even though the download was "stuck", the program was still running the thread normally, it was just not receiving any data from Rule34. Pretty similarly to how hard gelbooru limits the download speed when I'm downloading with -d 10.

This sort of thing didn't happen even once when downloading from E621 or danbooru when using -d 20. The download speed went down significantly, sure, but none of the download threads stopped receiving data.

Since the code that's responsible for the downloading is the same for all imageboards, the only hypothesis letf is that they rate-limit download requests from the same IP address

KylarZeppeli commented 2 years ago

Everything appears to be working perfectly now, is there anything else you have left to do for the queue_refactor branch?

FerrahWolfeh commented 2 years ago

Just finishing the placement of some debug messages and moving all text printing outside of the lib, then 0.28 should be finally released

FerrahWolfeh commented 2 years ago

I'm closing this issue now since everything discussed here has already been fixed