AlphaSlayer1964 / kemono-dl

A simple kemono.party downloader using python.
509 stars 82 forks source link

Various feature requests #58

Closed reyaz006 closed 2 years ago

reyaz006 commented 2 years ago

As a long time yiff.bat user, there are certain considerations and things I'd like to see implemented.

  1. I'm seeing quite a lot of potential problems. What if I hit 256 chars limit imposed by my file system? Do we filter all illegal chars so the folder will be created and not silently skipped? What if the post title updates? What if the amount of files inside the post updates so old numbers [1][2] will need to be replaced with [01][02] or [001][002]? I'd suggest to set maximum file path length and use constant numbering scheme like [0000].
  2. The main output structure. Keeping things in an organized directory tree is great, but as the collection grows it becomes a choir to navigate. There really needs to be an option to have a flat structure. I'd say "service.id" for folder names or optional "service.id.artistname" if user wants this. Inside, there should be no additional folders to keep it clean and easy.
  3. HTML data. In the end it's better to have 1 file with everything for each artist. An html file can have all the contents added into it, with proper metadata like dates, post titles and optional stuff like thumbnails, comments etc. You also likely don't want to rely on anyone's ability to find links to specific domains to save them as a separate additional .txt file - you can't imagine how many there are to break such a detection. I think all that is really needed is to handle anchors, images, and images inside anchors.
  4. Quick blacklisting ability. I can't measure how much HDD space I was able to save thanks to it. A lot of posts, including different entries from various services, contain dupes. Or stuff you want to not keep. And it's different for each artist - I don't want to skip all the files of specific format, I want to review the results and decide myself if I want to get rid of it, and make sure these files will not be re-downloaded on the next download session with updates enabled. I just create a .txt file with a specific name (e.g. blacklist.txt or mediocre.txt) and put the filenames in it, 1 per line. The script should see it, read it, and skip downloading these specific files. This is another reason why it's better to have all files in one folder for specific artist.
  5. What about setting the file creation date equal to the post date? Keep useful metadata where possible.
AlphaSlayer1964 commented 2 years ago

What if I hit 256 chars limit imposed by my file system? Do we filter all illegal chars so the folder will be created and not silently skipped?

Yes all illegal file/folder name characters are removed and the windows file/folder name character limit is maintained.

What if the post title updates? What if the amount of files inside the post updates

Ya the --update was an after though I never got back to so these cases are not handled. If the title changed a new post folder would be created. If the title didn't change but the files did then ya you would have the old files and possibly duplicates if they didn't have the same indexing.

use constant numbering scheme like [0000].

I have the string for the indexing use zfill so if you have 10 attachments the indexing will be [01], 100 attachments [001], 1000 attachments [0001], etc.

  1. The main output structure. Keeping things in an organized directory tree is great, but as the collection grows it becomes a choir to navigate. There really needs to be an option to have a flat structure. I'd say "service.id" for folder names or optional "service.id.artistname" if user wants this. Inside, there should be no additional folders to keep it clean and easy.

I agree for viewing the downloads it's not the best but I want to make sure everything is downloaded and organized. I have been slowly working on adding a dynamic output like yt-dlp but going back and adding it retroactively is a pain. Also this downloader is a side side hobby so I don't have a lot of time to invest in it.

  1. HTML data. In the end it's better to have 1 file with everything for each artist. An html file can have all the contents added into it, with proper metadata like dates, post titles and optional stuff like thumbnails, comments etc.

Maybe, but I can see updating that single html file on a second run being a pain in the ass to properly maintain.

You also likely don't want to rely on anyone's ability to find links to specific domains to save them as a separate additional .txt file - you can't imagine how many there are to break such a detection. I think all that is really needed is to handle anchors, images, and images inside anchors.

I assume you are referring to using --extract-links? That just takes all links that are hrefs in the content html and puts them in a file. I did remove saving inline images that are not hosted on kemono.party for this reason though.

  1. Quick blacklisting ability. I can't measure how much HDD space I was able to save thanks to it. A lot of posts, including different entries from various services, contain dupes. Or stuff you want to not keep. And it's different for each artist - I don't want to skip all the files of specific format, I want to review the results and decide myself if I want to get rid of it, and make sure these files will not be re-downloaded on the next download session with updates enabled. I just create a .txt file with a specific name (e.g. blacklist.txt or mediocre.txt) and put the filenames in it, 1 per line. The script should see it, read it, and skip downloading these specific files.

So unless the duplicates are in different post there should be no duplicate attachments. If there are that is something you want to report to kemono.party. If you mean no duplicates in all posts then I guess that would a useful feature.

  1. What about setting the file creation date equal to the post date? Keep useful metadata where possible.

Currently I'm just downloading what's on the party servers so if they don't have the metadata already there then it's not added.

Like I said this downloader is a side side thing. I only made it to download some stuff quickly from kemono party. Funny thing is I don't really use it my self now, except when there is a issue posted. If you wish to make a pull request for any of these features I'd be glad to look it over and at them. Otherwise it might be a bit till I add everything.

reyaz006 commented 2 years ago

I have the string for the indexing use zfill so if you have 10 attachments the indexing will be [01], 100 attachments [001], 1000 attachments [0001], etc.

If you don't handle change between these when needed during updates - make it constant, that's what my suggestion is about. Optional maybe.

I have been slowly working on adding a dynamic output like yt-dlp

I don't get what yt-dlp even has to do with all this. It's a youtube downloader, no?

Maybe, but I can see updating that single html file on a second run being a pain in the ass to properly maintain.

What's there to maintain? 1 artist = 1 combined html file. Update it on each run because it's small and quick. Or if api data is not changed - don't update.

So unless the duplicates are in different post there should be no duplicate attachments. If there are that is something you want to report to kemono.party. If you mean no duplicates in all posts then I guess that would a useful feature.

I don't think I understand this part properly. No, reporting is not an option because it's not only a pain and time wasting, but also subjective in most cases. I assume your tool does not remove files when they get wiped from the site's api anyway. There may be 100% dupes. There may be downscaled dupes, there may be re-encoded dupes, there may be just several-GB sized trash etc. You can't predict what parts of each folder user wants to get rid of, so let the user set this up in an obvious way.

Currently I'm just downloading what's on the party servers so if they don't have the metadata already there then it's not added.

I mean each post has the original post date and sometimes edit date. They can be transferred into the files' properties, so user may be able to search for files locally by file creation date if needed.

Like I said this downloader is a side side thing.

Ah well, I understand. Maybe at least it helps someone to better know the list of expected features.

AlphaSlayer1964 commented 2 years ago

If you don't handle change between these when needed during updates - make it constant, that's what my suggestion is about. Optional maybe.

I don't get what yt-dlp even has to do with all this. It's a youtube downloader, no?

I'm just referring to how yt-dlp lets you set an output template to set custom file and folder names based on the video, or in this case post, information. When I get around to adding this I will have to change how index works currently anyways.

What's there to maintain? 1 artist = 1 combined html file. Update it on each run because it's small and quick. Or if api data is not changed - don't update.

I don't work with writing html ever, just scraping it, so I was just covering for any potential issues that might come of it.

I don't think I understand this part properly. No, reporting is not an option because it's not only a pain and time wasting, but also subjective in most cases. I assume your tool does not remove files when they get wiped from the site's api anyway. There may be 100% dupes. There may be downscaled dupes, there may be re-encoded dupes, there may be just several-GB sized trash etc. You can't predict what parts of each folder user wants to get rid of, so let the user set this up in an obvious way.

I mean each post has the original post date and sometimes edit date. They can be transferred into the files' properties, so user may be able to search for files locally by file creation date if needed.

I was referring to reporting posts with duplicate files, and when I say duplicate I'm referring to exact duplicates not downscaled and re-encoded duplicates, these should also already be filtered out if they are in the same post. I also don't think the blacklist and metadata is a bad idea but when I made the downloader I was looking to get an exact organized copy so excluding similar files what not a concern for me along with adding metadata. I'm not saying I won't add anything you recommended just don't expect anything to quickly is all. I will close the issue for now but know I now have the ideas you proposed in mind.