Bionus / imgbrd-grabber

Very customizable imageboard/booru downloader with powerful filenaming features.
https://www.bionus.org/imgbrd-grabber/
Apache License 2.0
2.55k stars 216 forks source link

Storing tags in IPTC Keywords #2081

Closed thany closed 3 years ago

thany commented 4 years ago

Is your feature request related to a problem? Please describe

Sort of. When I want to preserve tags, pretty much my only option is to keep them in the filename. But paths longer than 260 characters can be a right chore to deal with, in most software. It's doable, but far from ideal.

Describe the solution you'd like

I'd like Grabber to be able to store tags in the IPTC Keywords field in image files. This doesn't work for video, but at least it does for images. The IPTC block is non-standard in PNG files, but supported by most software. In JPEG, it's been standardized for ages and all software that reads keywords from images, will read this field.

This could be achieved by utilizing exiftool using the following command for each file:

exiftool -sep "," -P -overwrite_original_in_place -IPTC:Keywords="$tags" "$file"

Explanation:

That way, Grabber doesn't have to implement very much in the main program. Just do an external call to exiftool for each saved file.

Describe alternatives you've considered

Saving tags in filenames, and then using a script in WLS to loop over each file, do the above exiftool operation, and rename to file to something shorter (in my case, the md5sum). This has to be done from WLS rather than the command prompt, because Linux can deal with superlong filenames a lot easier than most Windows tools (Windows itself is fine with them - it's the neccesary tools that aren't).

Additional context

The reason I'm suggesting exiftool over some kind of package for Grabber to use internally, is because exiftool is proven technology. It has withstood the test of time. It's been tested to death. I don't know about any packages that I have the same level of confidence for.

FichteFoll commented 4 years ago

As you hinted, storing tags in image metadata isn't exactly feasible because it not supported by all image formats. See #874 and #1793 for alternatives.

I seem to remember a mention of Grabber being able to run custom commands for downloaded images, so you could call exiftool directly or write a wrapper script that can do this. If that's not possible, it sounds like a reasonable feature request.

Bionus commented 4 years ago

Yes the issue with storing metadat in only part of the downloaded files is that it would be even more confusing for most end users. Even though this issue is highly requested.

You mention that both JPG and PNG are supported though, which I was not aware of. If that's true, that would represent a huge majority of downloaded files and "good enough" for me to add it as a native feature.

I seem to remember a mention of Grabber being able to run custom commands for downloaded images, so you could call exiftool directly or write a wrapper script that can do this. If that's not possible, it sounds like a reasonable feature request.

Yes. If you're going the road of running a binary for each downloaded image, Grabber can already do that in the "Options > Commands" part. Normally, this should work:

exiftool -sep "," -P -overwrite_original_in_place -IPTC:Keywords="%all%" "%path%"

Which looks almost identical to what you suggested, and does not require the intermediate "save all tags in the filename" as the tags are still present in the context when running the commands.

thany commented 4 years ago

@Bionus

You mention that both JPG and PNG are supported though, which I was not aware of. If that's true, that would represent a huge majority of downloaded files and "good enough" for me to add it as a native feature.

Full disclaimer: I said that, and it's true, but for PNG it is not standardized. It is for JPEG though. Of course, not standardized doesn't mean it won't work. I can only test in the software I use, and I can say with confidence that my software reads IPTC Keywords from PNG absolutely fine, but yours may have trouble with it. YMMV šŸ˜€

Normally, this should work:

Then one thing remains to be solved: how can I make Grabber separate tags by a comma or some other "non-tag character"? Because it looks like it's separating them by space. Exiftool can certainly deal with that, but it's a problem for tags with a space in them as well, So if %all% reads something like red and blue flower picking girl, that's 7 tags, even if red and blue was originally meant as 1 tag.

Bionus commented 4 years ago

Then one thing remains to be solved: how can I make Grabber separate tags by a comma or some other "non-tag character"? Because it looks like it's separating them by space. Exiftool can certainly deal with that, but it's a problem for tags with a space in them as well, So if %all% reads something like red and blue flower picking girl, that's 7 tags, even if red and blue was originally meant as 1 tag.

Did you check this? https://bionus.github.io/imgbrd-grabber/docs/filename.html#lists

For %all%, options for both "Lists" and "Tag lists" apply. But in your case you only need %all:separator=,% I guess? You can also use underscores if you want to use underscores instead of spaces in the tags.

thany commented 4 years ago

Thank you! I didn't know about that. I suppose the default should be changed to... something. Because spaces in tags and spaces as separators is not ideal in most cases, but that's what the default is.

FichteFoll commented 4 years ago

Because spaces in tags and spaces as separators is not ideal in most cases

Definitely agreed. I personally prefer using underscores inside tags and joining with spaces, since that's what most image hosting sites also use for search queries, but YMMV.

thany commented 4 years ago

So I went a bit further and pulled as much information from the available keys as I can to store in picture metadata. The command has become this monster:

exiftool -sep ";" -P -overwrite_original_in_place
  -IPTC:Keywords="%all:separator=;,ignorenamespace=artist,includenamespace,excludenamespace=general,unsafe%"
  -IPTC:Keywords="rating:%rating%"
  -IPTC:By-line="%artist%"
  -XMP:Creator="%artist%"
  -XMP:Rating="%score%"
  -XMP:PreservedFileName="%filename%"
  -XMP:Source="%url_page:unsafe%"
  -XMP:CreatorWorkURL="%website%"
  -FileModifyDate="%date:format=yyyy-MM-dd hh:mm:ss,unsafe%"
  -FileCreateDate="%date:format=yyyy-MM-dd hh:mm:ss,unsafe%"
  -EXIF:DateTimeOriginal="%date:format=yyyy-MM-dd hh:mm:ss,unsafe%"
  -EXIF:CreateDate="%date:format=yyyy-MM-dd hh:mm:ss,unsafe%"
  -XMP:DateCreated="%date:format=yyyy-MM-dd hh:mm:ss,unsafe%"
  -IPTC:TimeCreated="%date:format=hh:mm:ss,unsafe%"
  -IPTC:DigitalCreationDate="%date:format=yyyy-MM-dd%"
  -IPTC:DigitalCreationTime="%date:format=hh:mm:ss,unsafe%"
  "%path%"

(split over mulitple lines for readability)

So in human-speak šŸ˜€:

  1. Writes tags with prepended namespace, except for general tags. And it leaves out the artist tag.
  2. Also writes the rating (questionable/etc) as a tag, because for this there's no standardized tag in common use (there is an IPTC tag that provides similar information, but software seldom even sees this tag).
  3. Artist name is written to IPTC By-line and EXIF Creator tags. Yes, there are quite a few tags that are superfluous, but you can never really know which is picked up by software.
  4. The score is written to the Rating tag. I hope this is a number between 0 and 5 (inclusive) because that's what most software expects. This is not described in the documentation, unfortunately.
  5. The filename on the server is preserved in a tag that exists just for that. This value contains spaces (at least for yande.re it does) which doesn't seem right, but it doesn't really matter too much either way.
  6. The Source tag is supposed to be a reference to the source of a "document", which can be a URL, but can technically be anything, including a ridiculous description like "bookshelf 441-A, 4 down, 17 from the left" for offline/physical content.
  7. The CreatorWorkURL tag is technically supposed to be a URL pointing to the author of a work, but it seemed the most appropriate to put the name of the booru a picture came from. It's kind of useless, but I though I'd include it because it can't hurt.
  8. Finally, all kinds of dates and times. It updates the filedates, EXIF dates, IPTC dates and XMP dates. You can never know in advance which of these is picked up by software, and since only a singular date is available from Grabber, we should set all these date tags to the same value.

Why am I sharing this? Because it's pretty cool šŸ˜Ž, and perhaps it's a starting point for you (Grabber authors) to start thinking about maybe including this natively in some way.

Bionus commented 3 years ago

Just passing by to say that Exiftool support has just been added natively in Grabber! šŸ„³ I haven't had time to do in-depth testing and stuff, but my few tests were working well, so feel free to try it out.

Documentation can be found here: https://github.com/Bionus/imgbrd-grabber/blob/develop/docs/_docs/metadata.md#exiftool

Here's how it looks like:

perhaps it's a starting point for you (Grabber authors) to start thinking about maybe including this natively in some way.

To be honest, your post was the main inspiration to add this after finishing up the Windows Property System one (issue #2258). So I'd like to thank you for that šŸ‘

And apologize for not answering your comment earlier, I didn't notice there were a few points which required feedback.

I hope this is a number between 0 and 5 (inclusive) because that's what most software expects. This is not described in the documentation, unfortunately.

It's not. Depending on the source, it can go from -inf to +inf and represents the number of votes for an image. On some sources it can only go up (so from 0 to +inf). Definitely more than 5.

The Source tag is supposed to be a reference to the source of a "document", which can be a URL, but can technically be anything, including a ridiculous description like "bookshelf 441-A, 4 down, 17 from the left" for offline/physical content.

For this one I'd suggest %source% rather than %page_url%, since many websites are not the actual original source of a content, but sometimes provide it using this token. Maybe some conditional in case no source is provided could work, like <%source%?%source%:%page_url%>

The CreatorWorkURL tag is technically supposed to be a URL pointing to the author of a work, but it seemed the most appropriate to put the name of the booru a picture came from. It's kind of useless, but I though I'd include it because it can't hurt.

Yes, I don't think there's any other token that could fit this currently.

thany commented 3 years ago

Brilliant! šŸ¤©

I like that it's flexible so it's possible to add basically anything to anything.

However, it has to be said that it's also quite easy to mess up the formatting of a tag value. Dates are an obvious example of this, but also keywords - if the comma separator is forgotten, software will pick up the then space-separated string of keywords as a single keyword.

It might also be cool to overrule/extend these on a per-booru basis, in the Sources settings. Not sure if this is feasible though, and also not sure how well this is going to be used.

Finally, a way to revert to defaults might be a helpful tool when something has gotten messed up. Not knowing what a value "should be", can be frustrating if all your fiddling only makes things worse, and just want to put it back.

Bionus commented 3 years ago

I like that it's flexible so it's possible to add basically anything to anything.

Yes that was the goal, the user just decides whatever he wants to fill with whatever value he wants.

It might also be cool to overrule/extend these on a per-booru basis, in the Sources settings. Not sure if this is feasible though, and also not sure how well this is going to be used.

Not sure how helpful that would be, and the UI currently doesn't really support this kind of thing. So maybe in the future, but for now a global setting should be enough.

Finally, a way to revert to defaults might be a helpful tool when something has gotten messed up. Not knowing what a value "should be", can be frustrating if all your fiddling only makes things worse, and just want to put it back.

That's why there's no default value right now. My screenshot already contains customized values. This is considered an advanced feature, so people are free to put whatever they want at their own risk I guess. Same thing for the Windows Property System actually.

Adding default values or some kind of presets could make sense, but support would be a pain, with everyone having their own use case, and formats being complex, and so many extensions existing.

Bionus commented 3 years ago

BTW I see in the original issue that you use overwrite_original_in_place, while in this I used overwrite_original. Was there any particular reason for this choice? I read the docs and I don't feel it would change much? šŸ¤”

thany commented 3 years ago

In the scenario of an automated downloader, there's no discernible difference. The point of overwrite_original_in_place is to preserve attributes, security settings and whatnot for a file. But the file was newly created moments before Exiftool is executed on it. Might as well perform a "simpler" overwrite instead.

I personally use this option for other files that have been sitting around for longer, to preserve any attributes I may or may not have added.