Closed kwhopper closed 6 years ago
Thanks Kevin, this looks great.
Before merging it I wanted to ask one question. Given the performance issue surfaces when we are building up the XMP tree, and applies to any node that contains many children, I wonder if it's possible to explore a solution that adds all children in a single call, rather than adding them one at a time. That is, using something like AddRange(IEnumerable<XmpNode>)
instead of Add(XmpNode)
, and whether this new method can perform the validation more efficiently.
The motivation of such a solution would be to avoid the memory overhead of an additional field on every XmpNode, and an additional dictionary for every XmlNode
with children.
By now you are more familiar with this code than me, so I will defer to your judgement.
Adding the local lookup per node isn't a structural change for how the tree is created, whereas something like an AddRange function does change how the original code worked. The ParseRdf class does a lot of checking (and even changes these rdf:li node names) after the fact on a per node basis - and some of that checking is done by the new node itself. We would have to see if that exact sequence could be done in a range.
One other thought I had was to put a similar dictionary all the way in the root and the keys would be string paths or something to point at child nodes. In that scenario, there would only ever be one dictionary.
I'm leery of most options that fundamentally change the original Java code, mainly because I'm no XMP expert. Let me think about it (shortly) if that's ok.
Drew - I looked a bit and messing with the overall flow feels a bit dangerous. However, another compromise idea is to only new up the Dictionary once there are 'enough' nodes in the children List where Linq search performance suffers. This would likely drop the number of Dictionary instances in most cases to nearly zero. I have seen a number of performance posts but couldn't find a consensus for where the line is. It sounds like it might be a few as four or five, but there is overhead to make the Dictionary in the first place. I'm thinking around 10 is a decent compromise.
If you want, I can work that up as either a new branch commit or modify this one. The "Find" method inputs will go back to List
That sounds like a great solution, and an excellent trade off. The non-generic System.Collections.Specialized.HybridDictionary
does something like this, though I'm not sure it maintains order. Its source code suggests it cuts over at 9 elements:
private const int CutoverPoint = 9;
If you're happy to explore this in code, I don't mind how you do it re branches/commits. Feel free to just push more commits on this branch if that's easiest.
It's null-check heavy, but seems to get the job done. Feel free to comment or modify. Thanks
Very nice!
Pushed 5.1.3.1 to NuGet. Will test it out in MetadataExtractor now and verify the performance issue has been solved.
Great! The Dictionary lookup is cool, but might be deceptive for this particular image. Ignoring the "rdf:li" elements in the lookup is what makes it fast. However, any XMP's in the future that have a good number of non-array children should perform much better.
drewnoakes/metadata-extractor-images#18
Please review; copied from a working version but not tested after the copy. If I read the intention of other XmpCore code correctly, there is no reason to keep "rdf:li" array nodes in a search list. Ignoring them in the internal Dictionary makes a search very fast. At least for child nodes, this should keep a search very fast even if there are many non-array child nodes.