AlexCSDev / PatreonDownloader

Powerful tool for downloading content posted by creators on patreon.com. Supports content hosted on patreon itself as well as external sites (additional plugins might be required).
MIT License
928 stars 95 forks source link

Additional metadata and filters #170

Open skulkexpert opened 1 year ago

skulkexpert commented 1 year ago

This is just a suggestion post. The program works quite well and seems to download all the posts, including those missed by gallery-dl. Even from external links, which is very appreciated.

Metadata

However, it could use more configuration, such as being able to download more metadata. gallery-dl is a good example, where you can create json files with information about each post. At the very least, it would be useful to have the timestamp of the post written into the html files.

Filters and Duplicates

I think that this may have already been mentioned before, but it would be useful to have filters for files to skip. I download from a patreon that has a lot of large images and patreon seems to set a duplicate of one of these images in to the "post" image (the image downloaded as "_post_" in PatreonDownloader).

This leads to a lot of space being taken up by these duplicates. Since PatreonDownloader doesnt use any archives to keep track of downloaded images, whenever I try to update this patreon, I have to manually stop it in time and delete all of the unnecessary post files. gallery-dl has a useful solution for this by skipping this particular duplicate.

AlexCSDev commented 1 year ago

At this moment you can dump something similar to metadata with --json command line option. This will make a dump of all patreon responses which you can use for further parsing.

When it comes to duplicates the issue here is that I'm not sure if _post_ files are always identical to any of the post attachments. It is possible to add the option to ignore primary file if there are any attachments, but that can lead to missing files. Users are more than welcome to report their findings in regards to that because in order to be 100% sure about _post_ file being identical I need data from a lot of creators.

skulkexpert commented 1 year ago

@AlexCSDev Right, I missed the --json option, my bad. This is very useful, but I do still think that the html files should also have the post's timestamp together with the description.

gallery-dl seems to have some method of excluding duplicates, it may be worth checking out their source code. However, your idea of adding the option to ignore the primary file would fix my issue, so that would be really nice to have.

Having some kind of optional archive of downloaded files that can be skipped would also make it much easier to update. It doesnt have to be sqlite, a txt file with the filenames would be enough. That way, if you encounter a certain number of downloaded files, you could abort the downloader. Like some option called --abort that would take the number of skipped files to abort after. This would also allow the user to delete any unnecessary files and not have PatreonDownloader re-download them. Both gallery-dl and pixivUtil function this way, so it may be worth taking a look.