bpsm / edn-java

a reader for extensible data notation
Eclipse Public License 1.0
100 stars 24 forks source link

EDN List, Vector types indistinguishable due to common RandomAccess interface #32

Closed abernard closed 11 years ago

abernard commented 11 years ago

This may be a design decision, but it is not clear so here goes.

The documentation states "Lists "(...)" and vectors "[...]" are both mapped to implementations of java.util.List. A vector maps to a List implementation that also implements the marker interface java.util.RandomAccess."

However, due to this commit which changes the backing of the List type to ArrayList instead of LinkedList, there is no way to distinguish between Lists or Vectors when "roundtripped" through the parser.

This means that "(1, 2, 3)" and "[1, 2, 3]" will parse to identical values. Printing back the parsed List "(1, 2, 3)" to EDN via the printer will yield the Vector "[1, 2, 3]".

This is not caught in the unit tests because the method assertEquals merely tests that the parsed objects are equal (they are, as Vectors), but does not compare a second round of printing with the original string.

bpsm commented 11 years ago

Yes, that's a bug, though what this commit says is true: ArrayLists are indeed better than LinkedLists in that they use less memory and are faster to iterate.

Hanging the whole List/Vector distinction on the presence or absence of the RandomAccess marker interface is a bit of a hack. I'm not very happy with it and find myself wishing for Clojure-like meta-data. This strategy also presents a problem for edn-java-guava, since Guava's immutable list classes all implement RandomAccess.

Two alternatives that might be better than reverting the commit above to return java.util.LinkedList implementations:

  1. Return a java.util.List which merely wraps a java.util.ArrayList, but does not itself implement RandomAccess.
  2. Return a custom implementation of java.util.List which chunks by actually being a list of fixed-size chunks under the hood.

The downside of doing nothing is what this issue is about. The downside of 1 is the additional indirection introduced for delegation. (Though Iteration can be direct.) The downside of 2 is the implementation effort and probability of introducing bugs.

I'm thinking the first approach sounds like it's worth it; the second is unlikely to be worth it.

What do you think?

bpsm commented 11 years ago

See https://github.com/bpsm/edn-java/commits/feature/issue32

mikera commented 11 years ago

Why not just do a lightweight custom implementation of ArrayList? I don't think chunking actually buys us anything here, indeed it's probably going to add more overhead than an extra layer of indirection.

If you are interested in that approach, I think I've got some old code hanging around that would get us 80% there....

abernard commented 11 years ago

Ben,

I agree with your rationale. LinkedList has poor performance characteristics and option 2 seems like asking for trouble. Your provided solution is quite clean.

On the issue of this handling Guava collections, I wonder if the code to determine List or Vector should be separated out into an interface. This would be something like:

interface SequenceTypeSelector {
    boolean isVector(Object o);
}

The selector could be attached to the ProtocolBuilder for the Printer (with a default implementation provided of course). Extending SequenceTypeSelector would allow custom dispatch for types, allowing the simple if-else select on java.util.RandomAccess, or a Map lookup for more complex type hierarchies.

bpsm commented 11 years ago

I'm going to release https://github.com/bpsm/edn-java/commits/feature/issue32 as 0.4.2. Making distinguishing between list and vector pluggable as been moved to issue #33 and tentatively assigned to milestone 0.5.0.