elvis972602 / Kemono-scraper

A simple downloader to download media from kemono.party
MIT License
184 stars 10 forks source link

Proxy errors #9

Closed 1223334444abc closed 1 year ago

1223334444abc commented 1 year ago

There seems to be some errors in the proxy settings. When I specify a proxy server in this setting, such as setting it to --proxy http://127.0.0.1:1080 , still encountering some unreachable errors. And when I used a --proxy, I observed a link established with kemono.party, but it still prompts for various connection errors.

But when I took over all network connections using a virtual network card, the error no longer occurred. All downloads are proceeding normally. The virtual network card and proxy server use the same server connection. I doubt if there are any network connections that have not been overwritten by proxy settings.

Here are some error messages I have encountered:

Error getting favorites: Get https://kemono.party/api/favorites?type=user: dial tcp 199.59.148.209:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

HTTP:EOF (Forgotten specific information)

1223334444abc commented 1 year ago

I have tried HTTP, HTTPS, and Socks5 proxies, but none of them have been able to solve the problem.

1223334444abc commented 1 year ago

Most of the time, there are errors when obtaining the favorite list, but sometimes they start downloading images without any speed.

elvis972602 commented 1 year ago

It seems that some requests are not covered by the proxy. I will try to fix it.

1223334444abc commented 1 year ago

unexpected EOF download post error: failed to write file: unexpected EOF download post: xxxxxxx

Here is a new question. When the download file encounters the above error, it will not automatically retry the download and will be skipped. (and generate incomplete. tmp files)

And there is another small issue. 09m27.54s Download 70.3% 800 B/s 634.08 KB 16.png (Most files each have 200-300KB/s.) Due to network problem, some file downloads may experience prolonged delays. Can there be some mechanisms to solve this problem? For example, setting a download timeout based on average download speed and file size, or automatically retrying after how many seconds are below 1kb/s?

1223334444abc commented 1 year ago

Then there are some functional suggestions: (Taking this page as an example:)

  1. Have a txt file that can save the text in the "Content" section. Some pages contain key information such as the download link for the complete version of Google's online drive.

  2. Save the files in the "Downloads" section with their original file name. At present, it seems that all have been replaced with serial numbers.

  3. Hope to add the name of the source website, such as fanbox/fantia, before or after the. For example: [Fanbox]xxx

elvis972602 commented 1 year ago

Thank you for your advice! I will try to add some of the features Also, when there is an unexpected EOF, does it happen when a specific post is encountered or is it random?

1223334444abc commented 1 year ago

EOF errors occur randomly. Usually it doesn't appear when I download it again. The international network connection is quite unstable, and I need to access it all through a proxy server.

elvis972602 commented 1 year ago

I understand. I will try to add re-download and clear the temporary files.

elvis972602 commented 1 year ago

The file name will be replaced with the file's hash value to confirm more quickly that the file has been downloaded and is complete, as it appears that the file name on the site may be changed.

1223334444abc commented 1 year ago

In the above link, we would like the videos in the Downloads section to be saved as "xxx.mov", while the images are arranged in sequential order in the folder. This is more convenient for organizing and managing databases.

[Fanbox]xxxx [20211111] [111111] aaaaaaaaaaaa

xxx.mov and then 0.png 1.png 2.png .......... and Content.txt

And I suddenly realized that after encountering an EOF error, the other images after the error file in this post will not be downloaded.

elvis972602 commented 1 year ago

I see what you mean, this is a good suggestion, and this naming convention also seems more reasonable

1223334444abc commented 1 year ago

It seems that some requests are not covered by the proxy. I will try to fix it.

Thank you for your work. After testing, the new version can run in --proxy.

...... 5.08s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 1.29 MB/s 6.53 MB 7.png 5.25s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 1.24 MB/s 6.50 MB 6.png 302.82ms Failed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0% 0 B/s 0 B 13.png download failed 6.50s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 1.00 MB/s 6.53 MB 10.png 2.10s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 3.13 MB/s 6.56 MB 11.png ......

During the test run, it was found that this file reported an error, but was not retried but skipped. For me, I hope that all files will be constantly retried when encountering download errors until they are successful. Perhaps we should force a retry when encountering any errors? (For a large database, finding and filling in gaps is even more painful.)

Another small question is that downloading more than three files simultaneously in proxy mode will result in a 429 error. Does this mean that 'max download parallel' needs to be modified to below 3?

...... 1.41s Failed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0% 0 B/s 564 B 6.mp4 http 429 request too many times, retry after 1.0 seconds... 2.22s Failed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0% 0 B/s 564 B 6.mp4 http 429 request too many times, retry after 1.0 seconds... 55.17s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 3.53 MB/s 194.76 MB 5.mp4 1.50s Failed ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0% 0 B/s 564 B 6.mp4 http 429 request too many times, retry after 1.0 seconds... 57.77s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 4.63 MB/s 267.27 MB 1.mp4 01m48.84s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 2.62 MB/s 284.65 MB 2.mp4 01m3.03s Success ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0% 3.74 MB/s 235.98 MB 6.mp4 ......

elvis972602 commented 1 year ago

Yes, if you keep encountering HTTP 429, reducing max download parallel may be a good option.

1223334444abc commented 1 year ago

I have roughly looked at the "59a979f" branch (I don't know programming, I just skimmed through it), and perhaps .pdf (Multi page manga) or [.psd .psb .sai .pntr .clip] (Drawing Source File) also need to be considered. It would be even better if you could provide a file other than an image.

I checked my fanbox and fantia databases, and the file format is probably missing these.

elvis972602 commented 1 year ago

I was also wondering which category to put these graphics files in, maybe it would be better to put it in a separate category?

1223334444abc commented 1 year ago

PDF files should preferably be in a separate category, while source files [.psd .psb .sai .pntr .clip] should be in the same category.

It may be a bit redundant, but please also note that when obtaining the file name for the psd file mentioned above, it is "xxx.psd " instead of "Download xxx.psd ".

1223334444abc commented 1 year ago

I suddenly remembered a problem when using the command line to input parameters before: swapping the order of parameters would result in the inability to obtain 'creator'. Due to using. yaml instead, I forgot the specific error information before, but it does exist.

elvis972602 commented 1 year ago

I think the current categories are sufficient. You can use the default --template to determine their naming convention and use --image-template for the images. example:

template: "[<ks:service>] <ks:creator>/<ks:post>/<ks:filename><ks:extension>"
image-template: "[<ks:service>] <ks:creator>/<ks:post>/<ks:index><ks:extension>"
video-template: "[<ks:service>] <ks:creator>/<ks:post>/video/<ks:filename><ks:extension>"

The result will be something like: 0.jpg, 1.jpg, 2.jpg, 3.jpg, xxxx.pdf, video/xxxx.m4v.

1223334444abc commented 1 year ago

I think this is feasible. For files in 'Downloads', normal file names can generally be obtained. It is indeed possible to merge them for processing.

If convenient, provide a exe for try, waiting for new Releases.

elvis972602 commented 1 year ago

Sorry, I just forgot. Now you can download it in release

1223334444abc commented 1 year ago

Congratulations, the program has been running continuously for an hour, and all download errors have been retried. Each file is downloaded well according to the rules. Surprisingly, today I was able to use 'max download parallel: 10' without any problem. I will continue to run it and observe the situation.

elvis972602 commented 1 year ago

Thank you so much for your feedback and advice! I will close this issue for now. If there are any other problems, you are very welcome to open a new issue.