GlenKPeterson / Paguro

Generic, Null-safe, Immutable Collections and Functional Transformations for the JVM
Other
312 stars 24 forks source link

Make Collections Implement Serializable #10

Closed sblommers closed 8 years ago

sblommers commented 8 years ago

First of all, awesome library! I played with it for a while and really like it. We currently have a library just like yours also inspired by Clojure, PCollections, Scala and of course ADHD Spiewak. We would like to freeze that library and take another route because of our upgrade to Java8.

Maybe be a strange question but are there any plans to make the collections Serializable?

GlenKPeterson commented 8 years ago

Great idea! I think... Wait; why weren't they Serializable already?

We could just slap the word Serializable on there and add the serialVersionUID and hope for the best! But I think we should write some tests to prove that they work as expected. Some details deserve a second look, like how the empty collections are implemented (they might be better made as enums because they automatically serialize and deserialize as singletons - thank you Josh Bloch).

Is there anything special about the way you are (de)serializing these things that might influence what makes a good test?

Functions

When Java 8 first came out, I implemented some Comparators as lambdas. But lambdas are not serializable, so when you supply such a Comparator to a TreeSet or whatever, then serialize and deserialize the collection, the Comparator comes back as null. What a surprise! Maybe the function interfaces in this project should be serializable too?

Equator

The Equator interface needs to be Serializable. Oops. I just looked at it and the default comparator in there is a lambda. Test case! Enum!

Tuples

I did not make these serializable because I wanted people to extend them and not have to make their resulting classes serializable (it would have forced all subclasses to be serializable). I think I like that decision, but I'm open to constructive argument.

Hmmm... Tuple2 to implements Map.Entry. That requires some thought. Subclass? Entirely separate class? This might be 50% of the changes in the project!

Serialization Format

Do you care about the format? Should they persist as edn? http://stackoverflow.com/questions/3301439/clojure-data-structure-serialization

Also might look at how this project handles serialization (it excludes TreeMap): https://github.com/sritchie/carbonite

Caveats

I feel like I'm more than half-way done with a RRB-Tree which I've been spending all my free time on (which isn't much). I really want to finish that. But your suggestion is in the top 3, the other one being a Java 7 version for Android devs. I guess I'd want to make the serialization changes first, so this suggestion is currently in the top 2.

Conclusion

We should do this! I don't think there will be that many changes (outside the tests) but I'm not sure how much more we'll uncover after more thinking. I don't expect anything from you, just if you have time to do it before I do... If you want me to merge your code, please match the style, strive for absolute minimal changes, and write tests for every meaningful line of code you add.

sblommers commented 8 years ago

Thank you for your explanation and enthusiasm.

In theory the collections should be Serializable but without marking them they probably won't be allowed to be (de/)serialized through Java the default way. I know that at least the members of the class that is being serialized should be marked Serializable as well. Function (if not being a member of the class) or Tuple (also not being a member of the class) do not need to be Serializable. If they are a member they need to be marked. It should be fine if lambda's are used in functions but I see that for example PersistentHashMap has a member Equator. But I think that with minimal work this could be done.

Format I don't care that much about serialization format as long as it is thin which most of the times they are when not in XML or JSON. But our focus is Java8 and the use of standard Serialization markers for use in GWT (browser)/EhCache/Hazelcast relying on default Serialization. In GWT you would need to write Field_Serializer classes cluttering the project.

Custom format The main reason I am agains a custom serialization format is that I really that UncleJim has no dependencies to other frameworks.

Example unit-test code To fast-check if an object serializes and deserializes correctly I use this generic test function:

protected <T> T roundTripEquals(T obj) throws IOException, ClassNotFoundException {
    // write
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream oos = new ObjectOutputStream(baos);
    oos.writeObject(obj);

    final byte[] data = baos.toByteArray();

    ByteArrayInputStream baip = new ByteArrayInputStream(data);
    ObjectInputStream ois = new ObjectInputStream(baip);
    final T read = (T) ois.readObject();

    assertTrue(obj.equals(read));

    return read;
}

I will try and spend some time on this with minimal work. If I come up with something sooner I will definitely share with you what I have and match your code-styles, without tests it's incomplete.

GlenKPeterson commented 8 years ago

OK, my bad: this is a bug, not an enhancement. APersistentMap, APersistentSet, and APersistentVector in the Clojure source all implement Serializable. My copies don't. Thanks to Robert Harvey and kdgregory on Programmers StackExchange (and to you!) for pointing this out.

In Clojure AMapEntry extends APersistentVector. I think in type-safe Java land, Tuple2 should not implement UnmodMap.UnEntry. Instead, a new class should implement UnmodMap.UnEntry<A,B>, extend Tuple2<A,B> and override the tuple equals implementation. That cleans up all the tuple equals implementations too, and might clarify some weirdness with Transform.toImMap().

As a bug, this feels like it trumps my hobby project of the RRB-Tree. I'll make an effort to see how much of this I can get done this weekend if you don't do it before then. I'm probably going to have to do the tuple stuff anyway due to the complexity of the tuple generator. I also want to add StaticImports.xform(String). I'll use your test code though - thank you for that!

GlenKPeterson commented 8 years ago

It is done, but it's in a separate branch for now with a SNAPSHOT build number (1.1.0-SNAPSHOT). If you follow that link I made a first pass at release notes. Thank you very much for your unit test - I used most of that verbatim and gave you a credit in the code comments.

I incorporated the new version into 2 of my work projects this weekend. I ran all the tests on the one project and they passed. I also tried out the other and it worked fine. You will not be the first person trying this code. :-)

Are you able to try this out and let me know if it fixes your issue(s)? If so, I might be able to merge it to the main branch and post it to Maven Central as early as next week. Thank you again for reporting this issue.

sblommers commented 8 years ago

But that was pretty fast! :+1: Good to know you did not run into too many problems. I will give it a go as soon as I have time. Fortunately for me I'm on a holiday trip but will report back soon. I should be able to easily swap collection api-s for I did that already.

Update: I couldn't wait and quickly tested PersistentHashMap vs our custom Binary Int Tree based HashMap implementation, and it seems to work like a charm!

Thank you !!

GlenKPeterson commented 8 years ago

Glad to know that PersistentHashMap passed!

As you've seen, the basics seem to work, but I'm still hunting down dribs and drabs. I've been going through some of the default methods on the interfaces to find anything else that I can reasonably make serializable. I just found out that, keySet() on UnmodSortedMap is not serializable (yet - maybe tomorrow or this weekend). Most of the unmod...() methods on FunctionUtils are not serializable.

I learned today that java.util.TreeMap.entrySet().iterator() is not serializable. I can't fix that. But I wanted you to know that like the java.util collections, these collections will probably have limits to what's serializable. I'll try to document that before doing a release, but it's going to be another weekend anyway.

GlenKPeterson commented 8 years ago

Do you need any Iterators to be serializable? That might be where I draw the line because none of these seem to be Serializable in the java.util collections, which makes my interface changes really hard to test. entrySet() and keySet() are serializable in Paguro.

You already know that anonymous functions and classes are not serializable. If your Comparators are declared as anonymous functions or classes (which is common) they are not serializable and they either quietly disappear, or throw an exception on deserialization. That's true with java.util.TreeMap and TreeSet as well. Comparators are usually singletons, and enums are serializable, so implementing them as enums works around this about as well as I know how to do in Java.

GlenKPeterson commented 8 years ago

I've merged all changes to the master branch and released version 2.0.7 to Sonatype/MavenCentral. Each of the 5 main collections (Hash/Tree-Map/Set and Vector) now has a custom serialized form which is completely separate from it's internal representation and designed for minimal storage size. If you save a collection, and upgrade UncleJim/Paguro after I change how that collection is stored in memory, it will still deserialize correctly. All the map.keySet(), .entrySet(), and .values() should be serializable too (verified with tests). There's a complete list in the change log of this project of exactly what's serializable and what's not.

I did not do the FunctionUtils.unmodifiable... methods (yet?). Do you need them?

sblommers commented 8 years ago

Hi Glen, this is absolutely amazing and well implemented. It exactly fits the bill for our use case as long as the 5 main collections are serializable we are good to go. I'll fiddle with it and report any issues back here. Thank you very much.

And I like the name Paguro

GlenKPeterson commented 8 years ago

FYI, there is a class FunctionUtils that has all kinds of methods: unmodList(List<E> ls), unmodSet(Set<E> ss), etc. for converting a very mutable project to an immutable one. These have all been modified to return serializable classes in v.2.0.13. I'm still not planning to make Functions, Iterators, or Mutable collections serializable, but as far as I know, everything else is serializable now.