drewnoakes / metadata-extractor

Extracts Exif, IPTC, XMP, ICC and other metadata from image, video and audio files
Apache License 2.0
2.56k stars 480 forks source link

Unsafe constant definitions. #511

Closed garretwilson closed 3 years ago

garretwilson commented 3 years ago

Your library looks comprehensive; I'm looking forward to start using it.

I'm just exploring the API a bit as I investigate connecting it to my code. I see some places where you're using some really old practices (I used to use them, too) that are unwieldy to use with modern code and more importantly, unsafe.

In general arrays are something we want to keep away from in Java for lots of reasons. They don't fully support generics, and they aren't Iterable (even though the compiler lets us use them in an enhanced for loop). But worst of all, they are unsafe; anyone can modify them at any time.

When I see this in com.drew.imaging.FileType:

String[] getAllExtensions()

I really, really wish it were this:

Set<String> getAllExtensions()

That's more semantically correct, anyway, because there should be no duplicates. But here is the code:

    private final String[] _extensions;

…

    @Nullable
    public String[] getAllExtensions()
    {
        return _extensions;
    }

This returns the bare array that should be been encapsulated and hidden from the API consumer. This means anybody can do something sinister like this:

FileType.Jpeg.getAllExtensions()[2]="bmp"

And now everyone using the "constant" will think that "bmp" is an expected filename extension for JPEG files!

If you absolutely must use arrays, you must create a defensive copy of the array each time you return it. Otherwise you've essentially made your "constant" into a global. mutable variable shared by the entire JVM.

I also just now saw that the getAllExtensions() method is annotated as @Nullable. At first I thought it was a type, but indeed for the file type Cfbf, for example, you are indeed initializing the extensions with null and returning that from getAllExtensions(). Why on earth do that? null has completely different semantics. If the file type has no extensions, you simply want to return an array or collection that is empty, because an empty collection is what indicates "none".

This method should instead return an empty collection (or array, if you really must use them) instead of null if there are no extensions. Why should the caller have to do an extra null check just to find out if there are any extensions? null here means something like "I don't even want to tell you if there are any extensions". So the caller has to do a null check to see if there are "no extensions", but then after finding that the value is non-null the caller still has to check the link of the array to see if there are "no extensions". There is no purpose in having two ways to represent the same thing, at the risk of NullPointerExceptions for callers who didn't realize this unexpected behavior of the API.

The fact that getExtensions() returns a @Nullable is surprising, because the code goes out of its way to throw null into the mix! The FileType constructor uses varargs for the extensions. So Cfbf could have simply left off the extensions altogether, and it would all have have worked fine, and getExtensions() would have returned an empty array as we would expect.

    Cfbf("CFBF", "Compound File Binary Format", null),

Instead it takes extra effort to throw a null into the mix:

    Cfbf("CFBF", "Compound File Binary Format", null, (String[])null),

(The fact that this is a compound file type doesn't change the fact that "it has no extensions", i.e. an empty array. There is no reason for a null array of extensions.)

Anyway that's one of the first class I looked at. I still haven't called any code. I look forward to finish hooking it up and trying it out.

kwhopper commented 3 years ago

Would you be willing to create a pull request with your thoughts implemented so others can evaluate it? Thanks

Nadahar commented 3 years ago

I think many of the things you point to is more about "style" than correct/incorrect. I prefer to use Collections over arrays myself in most cases, but when I do low-level I/O stuff I tend to prefer arrays for efficiency.

The fact that arrays are always mutable is annoying, since you often want them to be immutable. Java should have some way to make them "read only". I don't know if one can really claim that one is more "unsafe" than the other though, a library isn't something meant for end users. Why would you modify the content of the array?

Collections doesn't solve it either unless you wrap them in unmodifiable. That's still potentially problematic thread-safety wise if the original is kept "un-wrapped". So, you really need to build the content, wrap it and then store the wrapped version. That way you can just hand the original out instead of making a copy every time access is needed. It does add some code though, and one could always ask how much effort it is worth creating such guardrails.

Even if you do all this and use wrapped Collections, somebody can still modify the content using reflection. So, I think the whole idea of making code "idiot proof" is questionable. I think you need to protect against things that are very likely to go wrong, and give developers the information they need to do it right, but if they still insist on trying to find a way to break things... that should really be up to them.

null here means something like "I don't even want to tell you if there are any extensions"

When it comes to returning null I completely disagree. I think null is a perfectly valid thing to return for "nothing". I always assume that everything can return null (except methods returning primitives) unless it's specifically annotated or otherwise indicates that it doesn't. I think creating objects just for returning an empty array or list is wasteful and wrong. I use Collections.emptyXXX() to avoid creating empty objects for in such cases when I don't think returning null is acceptable or desirable, but I think null is perfectly fine in many situations.

drewnoakes commented 3 years ago

First, thank you for your interest in the library.

Many of the patterns in this library date back to around 2001 when I wrote the first version.

Nowadays this library is used by a lot of people and we have to weigh up the benefits of changes against the compatability burden such changes put on users.

The case of exposing mutable inner state is certainly an issue in theory. You are the first person to call it out, and it has not as yet been an actual problem for you.

Sonatype shows 391,541 downloads last month. Given such usage there has to be a very compelling reason to break the API.

As for making defensive copies I do not believe the performance overhead is justified.

As for treatment of null vs. empty collections, again this is a matter of preference. Making a change here could break users in unexpected ways. There would need to be a clear benefit, which I can't see here.

garretwilson commented 3 years ago

Hi, everyone. First I'm very impressed that I received replies so quickly! I'll answer some of the questions various people raised above:

Would you be willing to create a pull request with your thoughts implemented so others can evaluate it? Thanks

Sure, I would love to! First I need to see how pervasive these sort of issues are, though. I would gladly fix one or two little problems here or there, but I really don't have time to go through and do a revamp if these sort of things appear all over the place. I haven't had time to even call the library yet from my own code to test it. 😄

I don't know if one can really claim that one is more "unsafe" than the other though, a library isn't something meant for end users. Why would you modify the content of the array?

Whether a library is meant for end users is not relevant to good program design in general. Information hiding and preventing mutable leakage is been a bedrock of solid development practice for a decade or two now. The benefit of immutability is a well-established modern best-practice. If you disagree you're going up against the majority of the expert opinion in the software development community. I won't argue the point; many authors have done that already. I would highly recommend Effective Java, Third Edition, for example.

When it comes to returning null I completely disagree. I think null is a perfectly valid thing to return for "nothing".

Here as well, returning an empty collection instead of null is a design principle so well established for so long that you would be hard pressed to find any experts agreeing with you. You can disagree, but you're disagreeing with people like Robert Martin, Josh Block, and Martin Fowler, as well as pretty much everyone on Stack Overflow.

Many of the patterns in this library date back to around 2001 when I wrote the first version.

I can certainly identify with that! You should see my code from around the same time—I used arrays all over the place, and have only recently been able to fix some of that old code.

Nowadays this library is used by a lot of people and we have to weigh up the benefits of changes against the compatability burden such changes put on users.

Oh, I wasn't proposing breaking the API. I was only proposing making it safer!

The case of exposing mutable inner state is certainly an issue in theory. You are the first person to call it out, and it has not as yet been an actual problem for you.

Most of these things are only "issues in theory", until one bites you. The more "issues in theory" you have, the higher risk you have. I saw a problem and I was pointing it out so that it could be fixed.

As for making defensive copies I do not believe the performance overhead is justified.

Performance overhead? This isn't low-level image manipulation. This is a public, shared constant. Allocating a few bytes and copying them in Java to read some constants is the least of your worries. I challenge you to find a single program in which you could even measure the difference!

This is just standard solid design practice following most if not all the experts in the field (mentioned above). That anyone would disagree was unexpected. So to revisit the question above:

Would you be willing to create a pull request with your thoughts implemented so others can evaluate it? Thanks

Most of my design approach follows the sort of best practices you would find in Effective Java, Third Edition. If there isn't a shared value for such design practices, I don't know if it would be worth the effort if there is a disagreement in principle before even starting.

drewnoakes commented 3 years ago

Some of our users run the library over enormous sets of images and are sensitive to GC pressure.

Personally I'm not very dogmatic about programming and systems design. I was moreso in the past, perhaps. I see the value in ideas, and draw upon them when needed on a case-by-base basis.

For what it's worth I used to work with Martin Fowler and have discussed this exact topic with him.

Most of these things are only "issues in theory", until one bites you.

Sure, but we have lots of users and no one has ever reported an actual problem related to this. I'm not disputing it'd be nice to fix some of these things up. The question is whether it makes sense for this library.

Note too that we aspire to support Java 1.6, so are limited in the kinds of collections we can utilise.

This project aims to welcome all improvements, modulo the compat and perf constraints laid out. If you have ideas, please share them. I would prefer to talk specifically rather than generally.

garretwilson commented 3 years ago

Sure, but we have lots of users and no one has ever reported an actual problem related to this.

Coincidentally I'm currently reading a book on SSL/TLS security, and this statement makes me think of things like Heartbleed and other OpenSSL bugs. (Sorry, I couldn't resist. But … the OpenSSL developers could have said the same sort of thing.)

This project aims to welcome all improvements, modulo the compat and perf constraints laid out. If you have ideas, please share them. I would prefer to talk specifically rather than generally.

Fair enough. Tell you what, let me start using the library and see how much of this type of stuff there is. As I mentioned above, if there are just a few things here and there, then I'll be glad to file a pull request. Cheers.

Nadahar commented 3 years ago

@garretwilson I think you should find another audience for your philosophical discussion. I don't think anyone here have any problem understanding how it could lead to problems if the developers using it insist on doing stupid things. This isn't about the general principle, it's looking at the big picture. How likely is this to cause real problems (the history of the project and countless others give some clues)? What is the cost of implementing such guardrails, both to the project and to efficiency. It all comes down to cost/benefit. There's really no limit to how far one could go in making such protections, but somebody has to chose to invest their time in doing it, and they must think it's a worth that time. Given how many other things that's usually just waiting for some attention in most projects, I think you'll find that it ends up pretty far down on the list of priorities.

That said, I don't think your particular example is very good. As said before, defensive copies is too expensive and Collections (which would break the API) are also mutable by default. As such, I'm wondering what exactly is the solution you're preaching? As I see it, your example would be a good learning opportunity to the parties involved, both when it comes to sorting or otherwise modifying objects you don't "own" (make a copy before modifying it instead), when it comes to thinking that making reports for a "live" system is somehow harmless, and most of all to those that wrote the Python downstream parser/processor not to make brittle code based on assumptions. It also shows why tests can only do so much - they can never replace good old thinking.

garretwilson commented 3 years ago

First let me apologize for prolonging the discussion longer than it was productive. I simply saw a flaw in an exciting new library and reported it. I honestly expected the response to be, "Good catch; we'll fix it in an upcoming release." I was surprised that there would be any controversy.

I think you should find another audience for your philosophical discussion

No problem. I have deleted the longer explanation above.

I'm wondering what exactly is the solution you're preaching?

I don't think it's very mysterious what I'm proposing:

@NotNull private final Set<String> extensions; //immutable

…

/**
 * @return all available extensions as an array.
 * @deprecated This method will be removed in an upcoming major version release;
 * please change your code to use {@link #getExtensions()}.
 */
@Deprecated
@NotNull //changed from @Nullable, but that shouldn't break any code
public String[] getAllExtensions()
{
  //if you just love nulls, you can check `isEmpty()` and short-circut by returning null
  return extensions.toArray(new String[extensions.size()]);
}

/** @return The set of all available extensions.  */
@NotNull
public Set<String> getExtensions() {
  return extensions;
}

defensive copies is too expensive

If you were to provide me with some objective, measurable, reproducible demonstration of whatever you mean by "too expensive", my response would be, "hey, this is an interesting problem; let me see what solution I can come up with to address it". Otherwise I can't create a solution to some amorphous hypothetical problem. I'm aware of garbage collection issues in general, but frankly I can't even fathom a scenario in which a defensive copy of an array when enumerating three or so file extensions would cause any measurable impact on the program as a whole.

Was there a particular application you're thinking of that this would affect? Does the application enumerate the file extensions every single time it updates a pixel in a 4K movie or something? I just don't get it. But I'm more than happy to solve this "problem" if you can demonstrate it to me.

To some extent that's irrelevant anyway, because as I said my opinion is that you should be migrating towards collections over the long term (nobody said it has to be tomorrow) as the code above illustrates.

At this point I've made clear what I'm proposing. It's also clear that there are fundamental differences in our design philosophies, so I'll leave it at that and let you make whatever decision you like. It's not my library after all. I'm just offering suggestions.

In any case congratulations on what seems to be a long-running, maintained library that is unique and popular. I wish you all the best on it, and I hope it works for my application, too.

Nadahar commented 3 years ago

@garretwilson I think you've gotten it a bit wrong. I'm merely a user of the library that has submitted a few contributions in the past. I don't "speak for the project" in any way.

I didn't mean that you should delete anything either, I only tried to say that I don't think it's useful to try to convince anybody of your philosophy here. This should be about concrete issues with the library.

@drewnoakes Is the author and project owner, so he's the one that makes the decisions. I think he has already expressed his stance, not breaking the API and avoiding unnecessary object creation trumps the minor inconvenience caused by the current solution. But again, this is just my interpretation, and I could be wrong.

When it comes to the "don't prematurely optimize" philosophy which you seem to be promoting, I happen to partly disagree with it. I say partly because I agree that it shouldn't be taken to the extremes - otherwise we would all be writing assembly. But, I believe in being conscious and not needlessly wasteful when coding. I think that teaching developers to do the opposite results in very slow and wasteful code, which I think we see in a lot of "modern" software. I get that it is convenient, and it might even make sense from a capitalistic point of view (it's cheaper to buy more hardware than hiring good developers). That's not the same as it being the "right choice" in every circumstance.

kwhopper commented 3 years ago

@garretwilson - There was nothing wrong at all with pointing out the issue. I think you expend a lot of theoretical capital instead of pointing out an in-context solution. I'm not criticizing the effort, only that it's harder to persuade in long prose than with actual code samples.

Java is not my thing by any means, but first it looks like "Set" is an interface so you have to use one of the underlying classes that implement it concretely. Second, it seems like you would still be able to change the contents of that underlying Set just like you can for the array (but don't quote me). Third, making that variable final doesn't necessarily change the second point. I think there has be some kind of read-only class in the mix for this to work. All I could find was Collection.unmodifiableList() as a candidate, especially since it's just a wrapper around the source list and would avoid new allocations.

Let me repeat - I'm not a Java guy so that could all be wrong.

However, this issue does not exist in the .NET implementation so there was obviously different thinking in the port. It uses a combination of readonly and IEnumerable to keep it locked away:

https://github.com/drewnoakes/metadata-extractor-dotnet/blob/a36ac5f472a17fa85fcf3dad3fd65bae1066744f/MetadataExtractor/Util/FileType.cs#L260

garretwilson commented 3 years ago

Java is not my thing by any means …

Ah, I see. I started the discussion assuming that everyone here was well versed in Java. (No problem—there are plenty of languages I'm not an expert in, either.)

… it looks like "Set" is an interface so you have to use one of the underlying classes that implement it concretely.

Yes! One of the foundations of modern software design (not just in Java) is "program to interfaces".

… it seems like you would still be able to change the contents of that underlying Set just like you can for the array … [and] making that variable final doesn't necessarily change the second point ….

Yes and yes. That's why you need to use either a read-only implementation of a set or a read-only wrapper. Later versions of Java have convenient Set.of() static factory method, but to keep everything backwards-compatible with older Java versions, you could use Collections.unmodifiableSet(). Yes, it is a wrapper, but the secret is that you don't expose the wrapped array and you store a final reference to the immutable wrapper, making it all safe and immutable!

I explain this extensively in my software development course, in the lesson on collections. (Jump down to the section on "Immutable Collections".) If you have any questions I'll be happy to clarify them, although I'll note that that each lesson depends somewhat on the material in the previous lessons.

kwhopper commented 3 years ago

There appears to be nothing special about using Set vs. wrapping unmodifiableList() around the existing String Array (or making a List from the start). In either case, it's a breaking change to the API. I think you should move this discussion to a Pull Request and see where it leads. There's not much point in dwelling on the theoretical if the API will change, and a PR at least would have functioning code to review.

garretwilson commented 3 years ago

There appears to be nothing special about using Set vs. wrapping unmodifiableList() around the existing String Array

But there's no point in wrapping the set around the string array if you expose the string array! It's no longer immutable. That's the whole whole point of this ticket.

(or making a List from the start).

And that's what I did in the code above. In order to hide the array, you would have to create a defensive copy anytime someone asked for it (in order to keep the existing API, even temporarily). Just creating a set from the start you can return the immutable set and not have to create anything! Sure, to maintain the existing API the array version will have to create an array and copy from the immutable set. But the idea is to migrate toward the collection-version, deprecating the array version.

Note that many callers will probably turn the array to some sort of Collection anyway, so this approach is even better because it it's already in the best form for them to use. (Arrays are very unwieldy to work with.)

(By the way, you keep mixing "sets" and "lists"; that confuses me." Semantically a set would be more appropriate, but then you would need to store your "preferred" filename extension separately. There are lots of ways to do it. But sets and lists are not the same.)

There's not much point in dwelling on the theoretical if the API will change, and a PR at least would have functioning code to review.

  1. My proposal would not be a breaking change to the API.
  2. I already gave you sample code in a comment to this ticket!

When I filed this ticket I just assumed that everyone was well-versed in Java, and I was frustrated that everyone was so against basic modern software development principles. Now the more everyone discusses this, it's becoming clear that some simply aren't understanding what I'm proposing. So I'll try my best to file a pull request. I can't guarantee when it will be, though.

If none of this is making sense to someone, perhaps that person could at least review my lessons on interfaces, generics, and collections just so we can be discussing from the same vantage point.

Nadahar commented 3 years ago

An array isn't a List so it can't be wrapped. There's no way to make an array immutable in Java, sadly. The "solution" is to make a copy every time the getter is called, which means creating a lot of unnecessary array copies.

Nadahar commented 3 years ago

@garretwilson I think I've understood what you've meant the whole time, and your example code removed any doubt. It however doesn't stick within the constraints of this post.

That is why I was wondering what exactly you were advocating. I couldn't see a way to improve it without either breaking the API or making copies, and it seems you couldn't either.

kwhopper commented 3 years ago

There appears to be nothing special about using Set vs. wrapping unmodifiableList() around the existing String Array

But there's no point in wrapping the set around the string array if you expose the string array! It's no longer immutable. That's the whole whole point of this ticket.

I described this imprecisely; coding through prose is bad for communication.

The least intrusive change is load the String values into a private List from the start instead of an Array. The public variable would be an unmodifiableList() instance wrapped around the private variable. Set doesn't really play into this.

It's still an API breakage.

garretwilson commented 3 years ago

It's still an API breakage.

~@kwhopper my proposal does not break the API. It simple doesn't.

I couldn't see a way to improve it without either breaking the API or making copies, and it seems you couldn't either.

@Nadahar I am going to repeat two things that I said before. The first is this:

If you were to provide me with some objective, measurable, reproducible demonstration of whatever you mean by "too expensive", my response would be, "hey, this is an interesting problem; let me see what solution I can come up with to address it". Otherwise I can't create a solution to some amorphous hypothetical problem. I'm aware of garbage collection issues in general, but frankly I can't even fathom a scenario in which a defensive copy of an array when enumerating three or so file extensions would cause any measurable impact on the program as a whole.

Was there a particular application you're thinking of that this would affect? Does the application enumerate the file extensions every single time it updates a pixel in a 4K movie or something? I just don't get it. But I'm more than happy to solve this "problem" if you can demonstrate it to me.

And the other important thing I said was this:

First let me apologize for prolonging the discussion longer than it was productive.

I apologize for not knowing when enough is enough. I really, really do. This has just wasted a lot of everybody's time. I will not file a pull request. I will close the ticket. I have work to do.