apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Implement standard Serialization across Lucene versions [LUCENE-1473] #2547

Closed asfimport closed 13 years ago

asfimport commented 15 years ago

To maintain serialization compatibility between Lucene versions, serialVersionUID needs to be added to classes that implement java.io.Serializable. java.io.Externalizable may be implemented in classes for faster performance.


Migrated from LUCENE-1473 by Jason Rutherglen, 1 vote, resolved Jan 24 2011 Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch (versions: 4), lucene-contrib-remote.patch

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Would it take any more lines of code to remove Serializeable from the core classes and re-implement RemoteSearchable in a separate layer on top of the core APIs? That layer could be a contrib module and could get all the externalizeable love it needs. It could support a specific popular subset of query and filter classes, rather than arbitrary Query implementations. It would be extensible, so that if folks wanted to support new kinds of queries, they easily could. This other approach seems like a slippery slope, complicating already complex code with new concerns. It would be better to encapsulate these concerns in a layer atop APIs whose back-compatibility we already make promises about, no?

asfimport commented 15 years ago

Wolf Siberski (migrated from JIRA)

This seems to be the right way to go. The patch attached removes all dependencies to Serializable and Remote from the core and moves it to contrib/remote. I introduced a new interface RemoteSearcher (not RemoteSearchable because I didn't want to pass Weights around), implemented by DefaultRemoteSearcher. An adapter realizing Searchable and delegating to RemoteSearcher is also included (RemoteSearcherAdapter. Encoding/Decoding of Lucene objects is delegated to the org.apache.lucene.remote.Serializer. For a sample serialization, I employed XStream which offers XML serialization (nearly) out-of-the-box. Everything is rather undocumented and would need a lot of cleanup, but as proof-of-concept it should be ok. Core and remote tests pass, with one exception: it is not possible anymore to serialize a RAMDirectory. What I don't like with the current patch is that a lot of different objects are passed around to keep the Searchable interface alive. Would it be possible to refactor such that Searchable represents a higher-level interface (or introduce a new alternative abstraction)?

asfimport commented 15 years ago

Wolf Siberski (migrated from JIRA)

This patch removes all dependencies to Serializable and Remote from the core and adds contrib/remote as replacement

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Thanks Wolf, +1 on the change. This issue proposes to do the same thing: #2481

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Thanks, Wolf, this looks like a promising approach.

Jason, John: would this sort of thing meet your needs?

I'm not sure we can remove everything from trunk immediately. Rather we should deprecate things and remove them in 3.0. The removal of Serializeable will break compatibility, so must be well-advertised.

HitCollector-based search should simply not be supported in distributed search. The Searchable API was designed for remote use and does not include HitCollector-based access.

Weighting, and hence ranking, does not appear to be implemented correctly by this patch. An approach that might work would be to:

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

To Wolf: Your patch looked like it was quite a bit of work, nice job! Restricting people to XML will probably not be suitable though. Some may want JSON or something that more directly encodes the objects.

General: It seems the alternative solutions to serialization simply shift the problem around but do not really solve the underlying issues (speed, versioning, writing custom serialization code, and perhaps dynamic classloading). The externalizable code will not be too lengthy and should be more convenient than alternatives to implement (with the code necessary being roughly equivalent to an equals method). For example protocol buffers requires maintaining files that remind me of IDL files from CORBA to describe the objects.

Deprecating serialization entirely needs to be taken to the java-user mailing list as there are quite a number of installations relying on it. If this is something that overlaps with SOLR then it would be good for the SOLR folks to separate it out as a serialization library that could be used outside of the SOLR server. This would be a good idea for most of the SOLR functionality otherwise there would seem to be redundant development occurring.

I'll finish up the Externalizable patch once #2391 is completed (IndexReader.clone) as it is something that needs feedback and testing to ensure it's workable for 2.9, whereas Externalizable is somewhat easier.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> shift the problem around but do not really solve the underlying issues

That's the idea, actually, to shift it out of the core into contrib. We could use Externalizeable there, with no XML.

> Deprecating serialization entirely needs to be taken to the java-user mailing list as there are quite a number of installations relying on it.

No, we make decisions on the java-dev mailing list. Also, it won't go away, folks might just have to update their code to use different APIs if and when when they upgrade to 3.0.

asfimport commented 15 years ago

Wolf Siberski (migrated from JIRA)

Thanks to Doug and Jason for your constructive feedback. Let me first clarify the purpose and scope of the patch. IMHO, the discussion about Serialization in Lucene is not clear-cut at all. My opinion is that moving all distribution-related code out of the core leads to a cleaner separation of concerns and thus is better design. On the other hand with removing Serializable we limit the Lucene application space at least a bit (e.g., no support for dynamic class loading), and abandon the advantages default Java serialization offers. Therefore the patch is to be taken as contribution to explore the design space (as Michaels patch on custom readers explored the Serializable option), and not as a full-fledged solution proposal.

> [Doug] The removal of Serializeable will break compatibility, so must be well-advertised. Sure. I removed Serializable to catch all related errors; this was not meant as proposal for a final patch.

> [Doug] The Searchable API was designed for remote use and does not include HitCollector-based access. Currently Searchable does include a HitCollector-based search method, although the comment says that 'HitCollector-based access to remote indexes is discouraged'. The only reason to provide an implementation is that I wanted to keep the Searchable contract. Is remote access the only purpose of Searchable/MultiSearcher? Is it ok to break compatibility with respect to these classes? IMHO a significant fraction of the current clumsiness in the remote package stems from my attempt to fully preserve the Searchable API.

> [Doug] Weighting, and hence ranking, does not appear to be implemented correctly by this patch. True, I was a bit too fast here. We could either solve it along the line you propose, or revert to pass the Weight again instead of the Query. The issue IMHO is orthogonal to the Serializable discussion and more related to the question how a good remote search interface and protocol should look like.

> [Jason] Restricting people to XML will probably not be suitable though. The patch does not limit serialization to XML. It just requires that encoding to and decoding from String is implemented, no matter how. I used XML/XStream as proof-of-concept implementation, but don't propose to make XML mandatory. The main reason for introduction of the Serializer interface was to emphasize that XML/XStream is just one implemantation option. Actually, the current approach feels like at least one indirection more than required; for a final solution I would try to come up with a better design.

> [Jason] It seems the alternative solutions to serialization simply shift the problem around but do not really solve > the underlying issues (speed, versioning, writing custom serialization code, and perhaps dynamic classloading). In a sense, the problem is indeed 'only' shifted around and not yet solved. The good thing about this shift is that Lucene core becomes decoupled from these issues. The only real limitation I see is that dynamic classloading can't be realized anymore.

With respect to speed, I don't think that encoding/decoding is a significant performance factor in distributed search, but this would need to be benchmarked. With respect to versioning, my patch still keeps all options open. What is more important, Lucene users can now decide if they need compatibility between different versions, and roll their own encoding/decoding if they need it. Of course, if they are willing to contribute and maintain custom serializers which preserve back compatibility, they can do it in contrib as well as they could have done it in the core. Custom serialization is still possible although the standard Java serialization framework can't be used anymore for that purpose, and I admit that this is a disadvantage.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Therefore the patch is to be taken as contribution to explore the design space [ ... ]

Yes, and it is much appreciated for that. Thanks again!

> Currently Searchable does include a HitCollector-based search method [ ... ]

You're right. I misremembered. This dates back to the origin of Searchable.

http://svn.apache.org/viewvc?view=rev&revision=149813

Personally, I think it would be reasonable for a distributed implementation to throw an exception if one tries to use a HitCollector.

> We could either solve it along the line you propose, or revert to pass the Weight again instead of the Query.

Without using an introspection-based serialization like Java serialization it would be difficult to pass a Weight over the wire using public APIs, since most implementations are not public. But, since Weight's are constructed via a standard protocol, the method I outlined could work.

asfimport commented 13 years ago

Jason Rutherglen (migrated from JIRA)

Won't be working on these and they're old