drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.55k stars 479 forks source link

Exploration: Make Locale configurable for formatting of tags (not to be merged) #538

Open rdvdijk opened 3 years ago

rdvdijk commented 3 years ago

As initially discussed in https://github.com/drewnoakes/metadata-extractor-images/issues/35, the current formatting of certain tag descriptions are locale-dependent. The only way to influence the formatting is to set the locale globally, using Locale#setLocale. This can have undesirable effects when integrating metadata-extractor in a larger application.

This pull request explores the idea of making the Locale configurable. A specific Locale is passed to JpegMetadataReader#readMetadata, and all the way down to the Descriptors that need it to format specific values. Two descriptors have been altered to show two specific examples:

See the two added tests in JpegMetadataReaderTest (testConfigurableLocaleEnglish and testConfigurableLocaleDutch).

There is still much left to be done:

I am curious what you think of this, any feedback is welcome.

Nadahar commented 3 years ago

I think this should be handled using the long fabled MetadataContext. Getting this implemented would also open other possibilities, like an updated take on #225.

Even though I've said it before in other places, I'd like to say once again that setting the default Locale isn't a realistic option in most cases. The default Locale is derived from the OS settings and will usually be what you need when presenting numbers, dates, currency and more to the user, and it is also used for case-conversion and setting the UI language. It can't be modified for any application with a user interface that needs this to be set correctly for both formatting and localization.

I edited the comment because I mixed this up with the similar issue of relying on default character encoding, which also causes issues.

Generally, if presenting the information directly to the user, using the default Locale can be what you want. But, as soon as you intend to parse the output further, or that the default Locale is involved in parsing of metadata, it will come down to "luck". I don't know if it is, but remember that the default Locale is used in many places, like every call to String.format(), Sting.toLowerCase() and String.toUpperCase(), not to mention NumberFormatter, DecimalFormatter etc.

rdvdijk commented 3 years ago

Thanks for your thoughts, and giving some more context, much appreciated.

I had actually considered creating a Settings class that would contain the Locale, instead of just passing the Locale directly. I decided against that to keep this pull request "to the point". However, such a Settings class is of course very similar to the MetadataContext that was discussed in #225. As was discussed in that pull request, I also made the Locale optional (@Nullable), not to break the current API.

I agree with you that the Locale should be applied in many, many more places throughout the code base. We would have to track down every String operation and Formatter, and pass the Locale to each of those.

What could be a way forward? I am thinking it would be relatively easy to introduce a small MetadataContext interface that supplies the Locale, but nothing more for now. E.g.:

interface MetadataContext {
    Locale getLocale();
}

Future concerns like filtering (#225) could be added to the interface in a separate issue.

However, actually applying the Locale in all places would turn into a huge pull request. Maybe create separate issues to apply the Locale for the separate file types?

Nadahar commented 3 years ago

@rdvdijk I think your plan sounds like a good one. I don't know how much it matters how "big" the PR is when measured in code changes or lines, as I see it, making "one logical change"/feature is as small as it can be. However, it doesn't really matter that much what I think, @drewnoakes is the one that will have to make the decisions.

I don't know if it's preferable to make the "context class", the overloaded API calls and a partial application of Locale or to make it all in one go. The important thing is that it all gets there in the end. The risk is of course, that if you make a PR now with only partial implementation of Locale, "the rest" will never follow.

As I remember it, @drewnoakes wanted to minimize the number of places where such an object would have to be passed on. This is where I didn't see clearly what would be the best solution, as storing it as any kind of "global" introduces potential concurrency traps very fast. I guess that if it could be attached to a "reader" that is used for only a single extraction for example, it would be safe though. I think agreeing on exactly how the "context object" should be made accessible all the places it's needed is essential before a lot of work is invested.

drewnoakes commented 3 years ago

This is very interesting. Thanks for picking this up.

I agree with @Nadahar that, given you're going to the effort of threading this through, it'd be worth adding a broader type into which other kinds of state/settings/etc can be added.

I'm not worried about a large PR, so long as the change is clear and consistent and doesn't change the world in some fundamental way. Your idea of breaking it down by file type, for example, makes sense too.

rdvdijk commented 3 years ago

I'll introduce the MetadataContext interface in this branch, and create a separate branch to apply it in the JPEG code. That might give us a rough indication on how much work it will be to apply through the entire code base.

@Nadahar wrote above:

As I remember it, @drewnoakes wanted to minimize the number of places where such an object would have to be passed on.

Related to that, I have a concrete implementation question: Shall I add MetadataContext to the TagDescriptor class? Or would the Locale be sufficient there?

Option 1: MetadataContext

public class TagDescriptor<T extends Directory>
{
    @NotNull
    protected final T _directory;
    @NotNull
    protected final MetadataContext _context;

    public TagDescriptor(@NotNull T directory, @NotNull MetadataContext context)
    {
        _directory = directory;
        _context = context;
    }

Option 2: Locale

public class TagDescriptor<T extends Directory>
{
    @NotNull
    protected final T _directory;
    @NotNull
    protected final Locale _locale;

    public TagDescriptor(@NotNull T directory, @NotNull Locale locale)
    {
        _directory = directory;
        _locale = locale;
    }

I am leaning towards Option 2 in this case, but I'm not sure if there is a need for other state/settings/etc here.

drewnoakes commented 3 years ago

I'll introduce the MetadataContext interface

Please make it a class. It's easier to version a class over time than an interface. If we add to an interface, it breaks all existing implementations. I don't think this would be an extension point for external code either.

and create a separate branch to apply it in the JPEG code

I'm happy for you to add it in the same branch. It'll be easier to see how it works that way, and change it before it is merged if it needs to.

Shall I add MetadataContext to the TagDescriptor class? Or would the Locale be sufficient there?

I'd suggest adding the whole context. There may be other bits on it we want in future.

rdvdijk commented 3 years ago

Thanks for the swift reply. I'll work on this and report back when I think the branch is at a point to discuss some more details.

rdvdijk commented 3 years ago

Did some work on this. My approach was to start at JpegMetadataReader, and work my way down to all the Readers that were called from there. I passed down the MetadataContext to those Readers, and in turn to their Directory and Descriptor classes. I've tried to find all of the DecimalFormat, String#format and String#toLowerCase/toUppercase calls, and configure those using the locale in the new MetadataContext class.

It's not all done yet, but it's a start. Here is the progress as far as I can tell:

@drewnoakes Can you see if this is going in the right direction?

Things of note:

rdvdijk commented 3 years ago

@Nadahar Also curious what you think of the changes so far.

Nadahar commented 3 years ago

This is quite a lot to look through. Once things I've noticed already is the chosen approach to preserve the existing API. This is only my opinion, and others might not share it, but I would have done it slightly differently.

Instead of requiring the MetadataContext instance in method calls, I would make it @Nullable everywhere, and make the code that use it be prepared that it might be null instead. I would then make sure that as long as it is null no behavior would change at all, the existing code should run as it is when that's the case. I would make overloads of methods (similar to what you've done) where all the existing methods would simply call the corresponding new method with null for MetadataContext.

I don't know if it makes much practical difference, but I prefer not to create objects when they're not needed, and I think using null as "preserve current behavior" would make it easy to make sure that nothing actually changes for those that use it without using MetadataContext.

Here's an example to illustrate what I mean, here is the "pre context version":

public SomeType foo(SomeOtherType bar) {
  ...
}

This is the corresponding "post context version":

public SomeType foo(SomeOtherType bar) {
  return foo(bar, null);
}

public SomeType foo(SomeOtherType bar, @Nullable MetadataContext context) {
  ...
}

Don't make any changes based on my opinion alone though.

rdvdijk commented 3 years ago

@Nadahar Thanks for the notes, much appreciated.

Instead of requiring the MetadataContext instance in method calls, I would make it @Nullable everywhere, and make the code that use it be prepared that it might be null instead.

I personally disagree with that. I really dislike passing along null everywhere, and would rather avoid that at all costs. (The inventor of "null", Tony Hoare, even called it his "billion-dollar mistake" :wink: )

Allowing null would mean that we'd have to add a null-check in many places throughout the code, and make decisions on what to do when we do not have a context. Having a "sensible default context" made more sense to me.

but I prefer not to create objects when they're not needed

My approach was to create one MetadataContext at the most, just as there is one Metadata instance being passed around. The overhead of such an object instance would be minimal.

deckerst commented 2 years ago

@rdvdijk I hope this PR will eventually reach maturity and get merged. Is there any blocker?

rdvdijk commented 1 year ago

@rdvdijk I hope this PR will eventually reach maturity and get merged. Is there any blocker?

I do not think there are any blockers, but discussion here kind of just stalled two years ago.

What do you think, @Nadahar and @drewnoakes ? Should I spend some time to bring this PR back to life?

rdvdijk commented 10 months ago

Ping @Nadahar and @drewnoakes .

Today we ran into something similar: a difference in system timezone caused unexpected timestamps. This is something that could also be fixed by the changes in this branch, or a subsequent pull request that adds configurable TimeZone support.

I would like to make progress with this PR, but I really need some more feedback if this PR is heading in the right direction. I'd be happy to take another approach if that is preferred.