bpsm / edn-java

a reader for extensible data notation
Eclipse Public License 1.0
100 stars 24 forks source link

Keywords should be interned during parsing #4

Closed bpsm closed 12 years ago

bpsm commented 12 years ago

"If the target platform supports some notion of interning, it is a further semantic of keywords that all instances of the same keyword yield the identical object."

See also https://github.com/edn-format/edn/blob/180328d96e0e48176618e2a92044bb23c5528593/README.md#keywords

mikera commented 12 years ago

I've implemented this before in another library so know what to do - want a patch for edn-java for this one?

bpsm commented 12 years ago

fixed. Also made interning configurable.

bpsm commented 12 years ago

@mikera I just comitted a fix before I saw you message. Please feel free to make suggestions on how it could be improved.

mikera commented 12 years ago

Just had a look, main comments would be:

Happy to fix up some of this if needed, let me know

mikera commented 12 years ago

Made a couple of quick commits in my branch which make the interning static. Hopefully this helps!

I think this is incompatible with the idea of configurable interning, since the interning code needs to be global. My view would be that it is more important to get a solid interning implementation working by default rather than having it configurable - it's one of those things where there is a "right way to do it" that should serve 99.99% of users.

bpsm commented 12 years ago

I considered global (static interning), but decided against it. I'll have a look at your commits and see if they change my mind.

As you've noted, global interning introduces concurrency issues, which Parser-scoped interning does not. A single Parser is not intended for use by more than one thread, nor would it make sense to support this.

Global interning can, in principle, use unbounded amounts of memory in a way that depends entirely on the peculiarities of the input data. I'd rather not encourage that.

Parser-local interning has a chance of staying the young generation. The maps and their content can be collected as soon as we're done with the parser. This seems like a potential win.

It's not that I'm barring the door to doing interning globally, it's just that it's not clear to me yet that it is needed.

This depends on what the aim is of the interning, and that's not clear to me yet. Right now, I just consider it a slight memory savings and efficiency hack, but that's surely influenced by my use case: I tend to parse a single file containing potentially thousands of homogenous maps. So, the same keyword occurs thousands of times. Clearly interning at the parser level is already a win here.

On the other hand, if I were parsing each of these same maps with a separate parser because I was retrieving each with an individual request, then I wouldn't see any gain from interning, but would that be such a loss? I'm not sure.

If interning means that we can always safely do this:

if (someKeyword == SOME_KEYWORD) { ... }

Then we'll need really global interning. (No public constructor, just a factory method for the class Keyword!)

If, on the other hand we're just hoping for less memory usage for large data sets and occasionally more efficient execution of:

if (someKeyword.equals(SOME_KEYWORD)) { ... }

Then we don't need global interning.

bpsm commented 12 years ago

@mikera I've reworked the Keyword interning such that it's global. I borrowed Clojure's approach to this, which you described briefly.

I've been thinking about it all afternoon, and decided that the most likely intended benefit for interning is to facilitate fast map lookups. To get that benefit, all keywords need to be interned and the equals method needs to know that it can rely on reference equality.

I've got an idea cooking for hooking into the scanner such that client code can exert some influence when token values are created by the scanner. This should allow clients who would benefit from interning of symbols or short strings to add this in themselves. (My application would benefit from interning symbols.)

mikera commented 12 years ago

Cool I think global interning is the right way to go - as you note the value of global interning is that it allows you to make map lookups.

The fast map lookup is actually a special case of a more general advantage of global interning - it allows you to make some stronger assumptions about keyword identity (because an equality check can become an identity check). So you get benefits for set membership tests, sorting etc.