Images issue 18 - Githubissues

kwhopper commented 6 years ago

drewnoakes/metadata-extractor-images#18

Please review; copied from a working version but not tested after the copy. If I read the intention of other XmpCore code correctly, there is no reason to keep "rdf:li" array nodes in a search list. Ignoring them in the internal Dictionary makes a search very fast. At least for child nodes, this should keep a search very fast even if there are many non-array child nodes.

drewnoakes commented 6 years ago

Thanks Kevin, this looks great.

Before merging it I wanted to ask one question. Given the performance issue surfaces when we are building up the XMP tree, and applies to any node that contains many children, I wonder if it's possible to explore a solution that adds all children in a single call, rather than adding them one at a time. That is, using something like AddRange(IEnumerable<XmpNode>) instead of Add(XmpNode), and whether this new method can perform the validation more efficiently.

The motivation of such a solution would be to avoid the memory overhead of an additional field on every XmpNode, and an additional dictionary for every XmlNode with children.

By now you are more familiar with this code than me, so I will defer to your judgement.

kwhopper commented 6 years ago

Adding the local lookup per node isn't a structural change for how the tree is created, whereas something like an AddRange function does change how the original code worked. The ParseRdf class does a lot of checking (and even changes these rdf:li node names) after the fact on a per node basis - and some of that checking is done by the new node itself. We would have to see if that exact sequence could be done in a range.

One other thought I had was to put a similar dictionary all the way in the root and the keys would be string paths or something to point at child nodes. In that scenario, there would only ever be one dictionary.

I'm leery of most options that fundamentally change the original Java code, mainly because I'm no XMP expert. Let me think about it (shortly) if that's ok.

kwhopper commented 6 years ago

Drew - I looked a bit and messing with the overall flow feels a bit dangerous. However, another compromise idea is to only new up the Dictionary once there are 'enough' nodes in the children List where Linq search performance suffers. This would likely drop the number of Dictionary instances in most cases to nearly zero. I have seen a number of performance posts but couldn't find a consensus for where the line is. It sounds like it might be a few as four or five, but there is overhead to make the Dictionary in the first place. I'm thinking around 10 is a decent compromise.

If you want, I can work that up as either a new branch commit or modify this one. The "Find" method inputs will go back to List only, but then the method body will decide whether to use the Dictionary (initializing it as needed) or Linq. The Add/Remove methods would also receive a null check as part of the lookup bits. Thoughts?

drewnoakes commented 6 years ago

That sounds like a great solution, and an excellent trade off. The non-generic System.Collections.Specialized.HybridDictionary does something like this, though I'm not sure it maintains order. Its source code suggests it cuts over at 9 elements:

private const int CutoverPoint = 9;

If you're happy to explore this in code, I don't mind how you do it re branches/commits. Feel free to just push more commits on this branch if that's easiest.

kwhopper commented 6 years ago

It's null-check heavy, but seems to get the job done. Feel free to comment or modify. Thanks

drewnoakes commented 6 years ago

Very nice!

drewnoakes commented 6 years ago

Pushed 5.1.3.1 to NuGet. Will test it out in MetadataExtractor now and verify the performance issue has been solved.

kwhopper commented 6 years ago

Great! The Dictionary lookup is cool, but might be deceptive for this particular image. Ignoring the "rdf:li" elements in the lookup is what makes it fast. However, any XMP's in the future that have a good number of non-array children should perform much better.

drewnoakes / xmp-core-dotnet

Images issue 18 #38