Implement standard Serialization across Lucene versions [LUCENE-1473]

asfimport commented 15 years ago

To maintain serialization compatibility between Lucene versions, serialVersionUID needs to be added to classes that implement java.io.Serializable. java.io.Externalizable may be implemented in classes for faster performance.

Migrated from LUCENE-1473 by Jason Rutherglen, 1 vote, resolved Jan 24 2011 Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch (versions: 4), lucene-contrib-remote.patch

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

Term implements Externalizable. Added serialVersionUID handling in the read/writeExternal methods. The long encoding needs to be variable long encoded to reduce the size of the resulting serialized bytes.

If it looks ok, I will implement Externalizable in other classes.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Do we really need to write the serialVersionUID? That's adding 8 bytes to the storage of each term.

The term storage is not particularly efficient when storing many terms in the same field, because eg the String field is not written as an intern'd string.

Also I see many tests failing with this, eg TestBoolean2 – I think we'll have to add:

public Term() {}

so deserialization can work? Which is then sort of annoying because it means it's possible to create a Term with null field & text (though, you can do that anyway by passing in "null" for each).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The serialVersionUID needs to be written if the class is going to evolve. It's written now, and currently in default serialization the field names are also written. We'll need empty constructors.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I share Michaels concerns. Whats the motivation for core Lucene classes supporting serialization and is it strong enough to warrant these changes? It comes with a cost even without the mentioned annoyances right? (which are only for the current class, there may be more?)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Lucene supports serialization explicitly by implementing Serializable.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The serialVersionUID needs to be written if the class is going to evolve.

Can we use a byte, not long, for serialVersionUID? And maybe change its name to SERIAL_VERSION? (I think serialVersionUID is completely unused once you implement Externalizable?).

This brings up another question: what's our back compat policy here? For how many releases after you've serialized a Term can you still read it back? This is getting complicated... I'm wondering if we shouldn't even go here (ie, make any promise that something serialized in release X will be deserializable on release Y).

I also think serialization is better done "at the top" where you can do a better job encoding things. EG this is the purpose of the TermsDict (to serialize many Terms, in sorted order).

Jason, what's the big picture use case here; what are you serializing?

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Lucene supports serialization explicitly by implementing Serializable.

Right, but we don't really support it (like many/most I would guess). There is a pain in the butt cost of support. Since this patch seems to push that pain around, I'm just wondering if the motivation for it is worth the cost (not knowing the motivation).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Currently serialization between 2.3 and 2.4 is broken, backwards compatibility is broken.

Saying "we don't really support it" means Serializable will not be implemented in any classes.

This really simple Java stuff that I'm surprised is raising concern here.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

The attached patch optimizes java serialization. Also, if we want java serialization to work cross-version, it gives us a leg to stand on. It doesn't change anything for most Lucene users, since Lucene doesn't use java serialization except for RMI. So, today, if you're using RemoteSearcher, and the local and remote versions have different versions of Term.java, things will probably fail, since, by default, the serialVersionUID is a hash of the method signatures and fields. If we want RemoteSearcher to work cross-version, then we need to explicitly manage the serialVersionUID and readExternal/writeExternal implementations. But is that really a priority?

As with all optimizations, the performance improvement should be measured and demonstrated to be significant. Also, an invasive change like this is hard to justify when so little of Lucene depends on java serialization.

asfimport commented 15 years ago

John Wang (migrated from JIRA)

the fact an object implements Serializable implies this object can be serialized. It is a known good java programming practice to include a suid to the class (as a static variable) when the object declares itself to be Serializable. If it is not meant to be serialized, why did it implement Serializable. Furthermore, what is the reason to avoid it being serialized? I find the reason being the cost of support kinda ridiculous, seems this reason can be applied to any bug fix, because this at the end of the day, it is a bug.

I don't understand the issue of "extra bytes" to the term dictionary if the Term instance is not actually serialized to the index (at least I really hope that is not done)

The serialVersionUID (suid) is a long because it is a java thing. Here is a link to some information on the subject: http://java.sun.com/developer/technicalArticles/Programming/serialization/

Use case: deploying lucene in a distributed environment, we have a broker/server architecture. (standard stuff), we want roll out search servers with lucene 2.4 instance by instance. The problem is that the broker is sending a Query object to the searcher via java serialization at the server level, and the broker is running 2.3. And because of specifically this problem, 2.3 brokers cannot to talk to 2.4 search servers even when the Query object was not changed.

To me, this is a very valid use-case. The problem was two different people did the release with different compilers.

At the risk of pissing off the Lucene powerhouse, I feel I have to express some candor. I am growing more and more frustrated with the lack of the open source nature of this project and its unwillingness to work with the developer community. This is a rather trivial issue, and it is taking 7 back-and-forth's to reiterate some standard Java behavior that has been around for years.

Lucene is a great project and has enjoyed great success, and I think it is to everyone's interest to make sure Lucene grows in a healthy environment.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The Spring framework http://www.springframework.org/ is a good example of a widely used open source Java project that implements and uses Serialization in most of it's classes. If Serialization will not be fixed in Lucene then perhaps it's best to implement a serializable wrapper in the Spring project for Lucene.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> At the risk of pissing off the Lucene powerhouse, I feel I have to express some candor. I am growing more and more frustrated with the lack of the open source nature of this project and its unwillingness to work with the developer community. This is a rather trivial issue, and it is taking 7 back-and-forth's to reiterate some standard Java behavior that has been around for years.

Whoa! I'm sorry if my questions are giving this impression. I don't intend to.

But I do have real questions, still, because I don't think Serialization is actually so simple. I too was surprised on looking at what started as a simple patch yet on digging into it uncovered some real challenges.

>Use case: deploying lucene in a distributed environment, we have a broker/server architecture. (standard stuff), we want roll out search servers with lucene 2.4 instance by instance. The problem is that the broker is sending a Query object to the searcher via java serialization at the server level, and the broker is running 2.3. And because of specifically this problem, 2.3 brokers cannot to talk to 2.4 search servers even when the Query object was not changed.

OK that is a great use case – thanks. That helps focus the many questions here.

> It is a known good java programming practice to include a suid to the class (as a static variable) when the object declares itself to be Serializable.

But that alone gives a too-fragile back-compat solution because it's too coarse. If we add field X to a class implementing Serializable, and must bump the SUID, that's a hard break on back compat. So really we need to override read/writeObject() or read/writeExternal() to do our own versioning.

Consider this actual example: RangeQuery, in 2.9, now separately stores "boolean includeLower" and "boolean includeUpper". In versions <= 2.4, it only stores "boolean inclusive". This means we can't rely on the JVM's default versioning for serialization.

{quote} > The serialVersionUID (suid) is a long because it is a java thing.

But, that's only if you rely on the JVM's default serialization. If we implement our own (overriding read/writeObject or read/writeExtenral) we don't have to use "long SUID".

> The problem was two different people did the release with different compilers.

I think it's more likely the addition of a new ctor to Term (that takes only String field), that changed the SUID.

> If it is not meant to be serialized, why did it implement Serializable. {quote}

Because there are two different things it can "mean" when a class implements Serializable, and I think that's the core disconnect/challenge to this issue.

The first meaning (let's call it "live serialization") is: "within the same version of Lucene you can serialize/deserialize this object".

The second meaning (let's call it "long-term persistence") is: "you can serialize this object in version X of Lucene and later deserialize it using a newer version Y of Lucene".

Lucene, today, only guarantees "live serialization", and that's the intention when "implements Serializable" is added to a class.

But, what's now being asked for (expected) with this issue is "long-term persistence", which is really a very different beast and a much taller order. With it comes a number of challenges, that warrant scrutiny:

What's our back-compat policy for "long-term persistence"?
The storage protocol must have a version header, so future changes can switch on that and decode older formats.
We need strong test cases that deserialize older versions of these serialized classes so we don't accidentally break it.
We should look carefully at the protocol and not waste bytes if we can (1 byte vs 8 byte version header).

These issues are the same issues we face with the index file format, because that is also long-term persistence.

asfimport commented 15 years ago

robert engels (migrated from JIRA)

I don't see why you can't just break compatibility between versions when dealing with Serialization. Just have it continue to mean live (or close to live) persistence.

Even the JDK does this (e.g. Swing serialization makes no guarantees). Just do the same - bound to change between releases...

Also, different compilers will generate different SUID... usually due to synthetic methods. It's kind of a problem...

asfimport commented 15 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

For the record: i have limited understanding of java serialization issues...

At the risk of pissing off the Lucene powerhouse, I feel I have to express some candor. I am growing more and more frustrated with the lack of the open source nature of this project and its unwillingness to work with the developer community.

The developer community consists of hundreds (possibly thousands) of people, who participate at various levels. At the time the above quoted comment was posted, 4 members of the community had expressed an opinion on this issue: 1 clearly in favor, and 3 questioning the advantages and disadvantages as they affect the whole community, both in terms of the performance impacts for existing use cases, and the long term support issues that might come from a change like this.

How anyone could consider these comments and questions "unwillingness to work with the developer community" blows my mind ... i do not see an overwhelming push by the community at large for a feature, i do not see a "Lucene powerhouse" arguing against the will of the masses ... I see two people arguing in favor of a change, and three questioning whether this change is a good idea (i'm not sure if i understand robert's post fully, i believe he's suggesting we maintain the status quo such that serialization is supported but no claims are made about back-compatible serialization).

I would define that as healthy discussion.

This, to me, seems to be the crux of the issue...

Lucene, today, only guarantees "live serialization", and that's the intention when "implements Serializable" is added to a class.

But, what's now being asked for (expected) with this issue is "long-term persistence", which is really a very different beast and a much taller order. With it comes a number of challenges, that warrant scrutiny:

...this jives with even my limited experience with java serialization, and i share Michael's concerns. The current behavior does not appear to be a bug, it appears to be the correct effects of a serializable class changing between two versions w/o any explicit policy on serialization compatibility. The changes being discussed seem to be a request for a new "feature": a back-compat commitment on the serialization of one (or more) classes. However small the patch may be, it's still a significant change that should not be made lightly ... I certainly think all 4 of Michael's bullet points should be clearly addressed before committing anything, and I agree with Doug's earlier observation regarding performance: we need to test that a new serialization strategy won't adversely affect existing users who rely on (what Michael refered to as) "live serialization".

This is my opinion. I voice it not as a member of any sort of mythical "Lucene powerhouse" but as member of the Lucene community who is is concerned about making sure that decisions are made towards "everyone's interest to make sure Lucene grows in a healthy environment." – not just the interests of two of vocal people.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

> the performance improvement should be measured and demonstrated to be significant

The initial concern was the incompatibility of serialized objects between Lucene versions. The performance improvements created by using Externalizable are secondary and so providing tests would be a waste of time if the "community" believes it is too much effort to add 1 line of code to a handful of classes. Implementing Externalizable is a way to reduce the size of the serialized objects, manage the serialized object versions, and provide performance improvements. Externalizable provides the most benefits and is very similar to the system Hadoop uses with Writeable. Externalizable works seamlessly with native object serialization and Serializable implemented classes, meaning it works with a number of existing Java classes in addition to Externalizable classes.

Using distributed serialized objects for search in Lucene is a natural Java based way to run a Lucene system. In many cases it is ideal because Java provides something C++ does not, dynamic in-process class loading. In a large grid based search system that requires 100% uptime this feature can be particularly useful.

Adding a serialVersionUID to the classes is one option, adding Externalizable is another option.

If the decision is to not support Serialization in Lucene then I recommend removing Serializable from all classes in Lucene 3.0 so that users do not mistakenly expect the search library to behave the way other Java libraries such as ICU4J, JDK class libraries, Spring, etc do.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> But, what's now being asked for (expected) with this issue is "long-term persistence", which is really a very different beast and a much taller order.

That's the crux, alright. Does Lucene want to start adding cross-version guarantees about the durability of its objects when serialized by Java serialization. This is a hard problem. Systems like Thrift and ProtocolBuffers offer support for this, but Java Serialiation itself doesn't really provide much assistance. One can roll one's own serialization compatibility story manually, as proposed by this patch, but that adds a burden to the project. We'd need, for example, test cases that keep serialized instances from past versions, so that we can be sure that patches do not break this.

The use case provided may not use RMI, but it is similar: it involves transmitting Lucene objects over the wire between different versions of Lucene. Since Java APIs, like Lucene, do not generally provide cross-version compatibility, it would be safer to architect such a system so that it controls the serialization of transmitted instances itself and can thus guarantee their compatibility as the system is updated. Thus it would develop its own representations for queries independent of Lucene's Query, and map this to Lucene's Query. Is that not workable in this case?

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

If it is not meant to be serialized, why did it implement Serializable. Furthermore, what is the reason to avoid it being serialized? I find the reason being the cost of support kinda ridiculous, seems this reason can be applied to any bug fix, because this at the end of the day, it is a bug.

The "implements Serializeable" was added to support RemoteSearchable. If we believe this creates a bug, then perhaps we should remove this and implement RemoteSearchable in another way. As it stands, Lucene does not support Java Serialization across Lucene versions. That seems to me more like a limitation than a bug, no?

Every line of code added to Lucene is a support burden, so we must carefully weigh the costs and benefits of each line. This issue proposes to add many lines, and to add a substantial new back-compatibility requirement. Back-compatibility is something that Lucene takes seriously. We make promises about both API back-compatibility and file-format back-compatibility. These already significantly constrain development. Adding a new back-compatibility requirement should not be done lightly, but only after broad consensus is reached through patient discussion.

asfimport commented 15 years ago

robert engels (migrated from JIRA)

In regards to Doug's comment about an alternate form... doesn't SOLR already have a XML based query format?

If so, just persist the queries using this. You will be immune to serialization changes (provided the SOLR parser remains backwards compatible).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The "implements Serializeable" was added to support RemoteSearchable. If we believe this creates a bug, then perhaps we should remove this and implement RemoteSearchable in another way. As it stands, Lucene does not support Java Serialization across Lucene versions. That seems to me more like a limitation than a bug, no?

There will be complaints no matter what. GWT tried getting around people having to implement Serializable by providing an interface with fewer promises: isSerizable. Many complained right away, as they had other classes that perhaps they where making Serializable simply for Apache XMLRpc or something. So now you can use either Serializable or isSerialiazble.

Personally, I think its fine to do as we are. I'm not against supporting more though.

If we choose not to go further (and from what I can tell that decision has not yet been made yet, against or for) add to the javadocs about what we support, as I don't think its a bug myself. The Serializable interface indicates that the class and its subclasses will be Serializable, my reading of the javadoc does not indicate what cross version compatibility must be supported. I believe that is up to the implementor.

Mark

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

"This is a hard problem."

I disagree. It's completely manageable. Doesn't Hadoop handle versioning inside of Writeable classes?

ScoreDocComparator javadoc "sortValue(ScoreDoc i) Returns the value used to sort the given document. The object returned must implement the java.io.Serializable interface. This is used by multisearchers to determine how to collate results from their searchers."

This kind of statement in the code leads one to believe that Lucene supports Serialization. Maybe it should be removed from the Javadocs.

"Thrift and ProtocolBuffers" don't support dynamic class loading. If one were to create their own Query class with custom code, serializing is the only way to represent the Query object and have Java load the additional implementation code. One easy to see use case is if Analyzer were made Serializable then indexing over the network and trying different analyzing techniques could be accomplished with ease in a grid computing environment.

"representations for queries independent of Lucene's Query, and map this to Lucene's Query. Is that not workable in this case?"

Mike wrote "if we add field X to a class implementing Serializable, and must bump the SUID, that's a hard break on back compat. "

There needs to be "if statements" in readExternal to handle backwards compatibility. Given the number of classes, and the number of fields this isn't very much work. Neither are the test cases. I worked on RMI and Jini at Sun and elsewhere. I am happy to put forth the effort to maintain and develop this functionality. It is advantageous to place this functionality directly into the classes because in my experience many of the Lucene classes do not make all of the field data public, and things like dedicated serialization such as the XML query code are voluminous. Also the half support of serialization right now seems to indicate there really isn't support for it.

Hoss wrote: "sort of mythical "Lucene powerhouse" Lucene seems to run itself quite differently than other open source Java projects. Perhaps it would be good to spell out the reasons for the reluctance to move ahead with features that developers work on, that work, but do not go in. The developer contributions seem to be quite low right now, especially compared to neighbor projects such as Hadoop. Is this because fewer people are using Lucene? Or is it due to the reluctance to work with the developer community? Unfortunately the perception in the eyes of some people who work on search related projects it is the latter.

Many developers seem to be working outside of Lucene and choosing not to open source in order to avoid going through the current hassles of getting code committed to the project.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

"In regards to Doug's comment about an alternate form... doesn't SOLR already have a XML based query format? If so, just persist the queries using this. You will be immune to serialization changes (provided the SOLR parser remains backwards compatible)."

SOLR does not have an XML based query format. XML is not ideal for distributed search because it is slow and verbose. There are many ways to serialize things, the issue is not in choosing one, but in supporting what most Java libraries do today which is native to the Java platform.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Mark: "There will be complaints no matter what. GWT tried getting around people having to implement Serializable by providing an interface with fewer promises: isSerizable. Many complained right away, as they had other classes that perhaps they where making Serializable simply for Apache XMLRpc or something. So now you can use either Serializable or isSerialiazble.

Personally, I think its fine to do as we are. I'm not against supporting more though. "

Externalizable and Serializable work interchangeably, a nice feature of Java. For classes that no one submits an Externalizable patch for, the serialVersionUID needs to be added. For ones that implement Externalizable, there is slightly more work, but not something someone with a year of Java experience can't maintain.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Doesn't Hadoop handle versioning inside of Writeable classes?

Currently, yes, but this probably insufficient for a Hadoop 1.0 release. Hadoop is a distributed system, and would like to provide RPC back-compatibility across minor versions after we go 1.0. This is an explicit decision that's in the process of discussion within the Hadoop project. RPC's within Hadoop daemons will probably require identical versions – all daemons in a cluster must be upgraded in lockstep. We'll thus probably limit back-compatibility to client RPC's, so that a given client can talk to multiple clusters that are not running identical versions of Hadoop. Lucene has made no such explicit policy decision. Lucene is not an inherently distributed system.

Hadoop has not yet decided what mechanism to use to support back-compatible RPC, but Writable versioning is not sufficient, since it does not handle RPC protocols, and it's lower-level than we'd prefer. We'll probably go with something more like Thrift or ProtocolBuffers. Hadoop does not use Java serialization and makes no promises about that.

> The developer contributions seem to be quite low right now, especially compared to neighbor projects such as Hadoop.

As one who monitors both projects, I don't see a marked difference. In both there are sometimes patches that unfortunately languish, because, while they're important to the contributor, they fail to sufficiently engage a committer. For example, HADOOP-3422 took over 6 months to get committed, probably because not many committers use Ganglia.

There is a difference in quantity: hadoop-dev has over 4x the traffic of lucene-dev. But, normalized for that, the number of patches from non-committers feels comparable. If anything I'd guess Lucene commits more patches from non-committers than does Hadoop.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

The documentation should probably be fixed to state that Lucene's use of Serializeable currently assumes that all parties are using the exact same version of Lucene. That's the default for Serializeable, but it probably bears stating explicitly. Then we should decide, going forward, whether this should change, and, if so, for which classes and how. Such a policy should be agreed on before code is written, no?

asfimport commented 15 years ago

robert engels (migrated from JIRA)

Jason, you are only partially correct.

SOLR has XML definitions for updates/commits/deletes. It also supports string based queries. It also supports using XML for queries if you provide a handler, but the string syntax is simpler.

As for the serialization performance, you are mistaken.

For things like searches, the parsing time is extremely minor compared with the search time, so this additional overhead would be a fraction of the cost.

When returning results sets, using XML can make a huge difference, as the overhead to paid on every item. Even still with modern XML processors, the search time is still going to be the overriding performance factor by a huge margin. Typically "paged" results are also used, so again, the XML parsing compared to the network overhead, is going to be minor.

Still, if it is only for temporary serialization, binary works best - as long as it is Java to Java.

We have a search server that uses Java serialization for the message passing, including the results. It can be done without any changes to Lucene - again the overriding performance factor is the search itself (unless the queries returns 100k + documents and all are being returned - then the Java serialization time can be more than the search itself...

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Robert: > using XML for queries if you provide a handler,

That doesn't sound like query serialization.

SOLR has a binary protocol due to criticisms about XML being slow.

I'm not sure why you and Doug and focusing on performance when that is not really the main issue I brought up.

Also I'm confused as to why dynamic classloading is being ignored by you folks as a Java feature that a Java search library could take advantage of to differentiate itself in the search (closed and open source) marketplace.

asfimport commented 15 years ago

robert engels (migrated from JIRA)

The reason the XML is not needed, is because the string format is robust enough, and is simpler...

I am not focused on performance. It is just that Java serialization works well for temporary persistence. Other formats are better for long-term persistence. If you are only doing temporary persistence, you don't need backwards compatibility.

Also if the API is exposed to the world (i.e. non-Java), a human readable (or even binary) format that is not based on Java serialization is going to work much better.

if you use a non-binary protocol it is far easier to extend it in the future, and retain the ability to easily read older versions. This is why Swing uses XML for serialization, and not binary.

You could certainly use specialized class loaders to load old versions of classes in order to maintain the ability to read old now incompatible classes... it is just a lot of work (maintenance too, need to keep the old code around... etc.) for not a lot of benefit.

As for SOLR's binary protocol, fine, but it is probably for a fringe use case, or the submitter didn't do real world tests... The XML parsing is just not that much greater than binary (at least in Java, since it is the object creation they both use that affects it). The search time is going to be far greater.

For large updates a binary loader can be more efficient that XML, but if you test it using real-world examples, I doubt you will see a huge difference - at least for the types of application TYPICALLY written using Lucene.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> I'm not sure why you and Doug and focusing on performance when that is not really the main issue I brought up.

In the description of this issue you claim, "This will make Serialization faster". I only added that, if this is a motivation, then it should be benchmarked and quantified. If it's not a motivation, and cross-version compatibility is the only motivation, then that should be clarified.

> dynamic classloading is being ignored by you folks

Perhaps most Lucene developers are not using dynamic class loading, just as most Lucene developers seem not to be relying on java serialization. Perhaps you can convince them to start using it, but if not, you may need to find a way to get Lucene to support it that minimally impacts other users of Lucene. Adding methods to Term and Query that must be changed whenever these classes change adds a cost. If folks don't see much benefit, then that cost outweighs. Perhaps you can better enlighten us to the benefits rather than assert willful ignorance? Jini promised great things with dynamic class loading, but the list of folks that use Jini is not long (http://www.jini.org/wiki/Who_uses_Jini%3F).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The discussion has evolved out of scope. Cross-version compatibility is the main goal. We have multiple versions of Lucene using the Spring RPC protocol. The standard way to solve this is add a serialVersionUID like HashMap, String, and the other Java classes do.

Performance is a topic of concern for Lucene users and existing RMI/Serialization users would transparently benefit by Externalizable being used.

I would like to implement this as a separate project, however performing reflection on each of the objects is not an efficient approach. Writing wrapper code for each and every variation of Query is a waste of time when it implements Serializable and the communication is between Java systems.

It seems best to remove Serialization from Lucene so that users are not confused and create a better solution.

> Perhaps most Lucene developers are not using dynamic class loading,

Dynamic classloading is popular and accepted in J2EE servers. If done properly it is a very convenient way of deploying Java based systems. Jini did not make this convenient enough. Jini did not have a specific problem it was trying to solve and so was too complex and open ended. Spring and J2EE make use of Java serialization for distributed objects. People may not be using Lucene for this today but this is due largely to lack of support for things like standard Serialization. With Lucene it is possible to make dynamic search classloading convenient in a search grid environment.

When J2EE was designed, no one was using Java on the server side. A framework was composed and a handful of companies implemented the specification and then found it's way into projects. If you are looking for users to ask for something that does not exist like this, it will not happen.

The interface Lucene exposes is relatively static and known. All of the top level classes, Query, Analyzer, Document, Similarity do not change very often. In a search based grid computing environment, the ability to execute arbitrary code against the cloud saves time and effort in deploying new code to servers. Restarting processes in a production environment is always expensive. If one is implementing for example mutating genetic algorithms using Java over Lucene then it would be advantageous to dynamically load the classes that implement this. They would be modifications of several classes such as Query, Similarity.

It is peculiar all the effort that goes into backwards compatibility of the index, but for Serialization it is ignored. This is and will be very confusing to users, especially ones who use J2EE.

asfimport commented 15 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

The contrib section of Lucene contains an XML-based query parser which aims to provide full-coverage of Lucene queries/filters and provide extensibility to support 3rd party classes. I use this regularly in distributed deployments and this allows both non-Java clients and long-term persistence of queries with good stability across Lucene versions. Although I have not conducted formal benchmarks I have not been drawn to XML parsing as a bottleneck - search execution and/or document retrieves are normally the main bottlenecks.

Maintaining XML parsing code is an overhead but ultimately helps decouple requests from the logic that executes requests. In serializing Lucene Query/Filter objects we are dealing with the classes which combine both the representation of the request criteria (what needs to be done) and the implementation (how things are done). We are forever finessing the "how" bit of this equation e.g. moving from RangeQuery to RangeFilters to TrieRangeFilter. The criteria however remains relatively static (" I just want to search on a range") and so it is dangerous to build clients that refer tdirectly to query implementation classes. The XML parser provides a language-independent abstraction for clients to define what they want to be done without being too tied to how this is implemented.

Cheers Mark

asfimport commented 15 years ago

robert engels (migrated from JIRA)

Even better. Thanks Mark.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The contrib section of Lucene contains an XML-based query parser which aims to provide full-coverage of Lucene queries

Thanks for the reminder... Solr has pluggable query parsers now, and I've been meaning to check this out as a way to provide a more programmatic query specification.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It seems best to remove Serialization from Lucene so that users are not confused and create a better solution.

I don't think that's the case. If we choose to only support "live serialization" then we should add "implements Serializable" but spell out clearly in the javadocs that there is no guarantee of cross-version compatibility ("long term persistence") and in fact that often there are incompatibilities.

I think "live serialization" is still a useful feature.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

For classes that no one submits an Externalizable patch for, the serialVersionUID needs to be added.

The serialVersionUID approach would be too simplistic, because we can't simply bump it up whenever we make a change since that then breaks back compatibility. We would have to override write/readObject or write/readExternal, and serialVersionUID would not be used.

asfimport commented 15 years ago

John Wang (migrated from JIRA)

Mike:

   If you have class A implements Serializable, with a defined suid, say 1.

   Let A2 be a newer version of class A, and suid is not changed, say 1.

    Let's say A2 has a new field.

   Imaging A is running in VM1 and A2 is running in VM2. Serialization between VM1 and VM2 of class A is ok, just that A will not get the new fields. Which is fine since VM1 does not make use of it. 

   You can argue that A2 will not get the needed field from serialized A, but isn't that better than crashing?

    Either the case, I think the behavior is better than it is currently. (maybe that's why Eclipse and Findbug both report the lacking of suid definition in lucene code a warning)

   I agree adding Externalizable implementation is more work, but it would make the serialization story correct.

-John

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Serialization between VM1 and VM2 of class A is ok, just that A will not get the new fields. Which is fine since VM1 does not make use of it.

But VM1 might require an older field that the new field replaced, and VM1 may then crash in an unpredictable way. Not defining explicit suid's is more conservative: you get a well-defined exception when things might not work. Defining suid's but doing nothing else about compatibility is playing fast-and-loose: it might work in many cases, but it also might cause strange, hard-to-diagnose problems in others. If we want Lucene to work reliably across versions, then we need to commit to that goal as a project, define the limits of the compatibility, implement Externalizeable, add tests, etc. Just adding suid's doesn't achieve that, so far as I can see.

asfimport commented 15 years ago

robert engels (migrated from JIRA)

Even if you changed SUIDs based on version changes, there is the very real possibility that the new code CAN'T be instantiated in any meaningful way from the old data. Then what would you do?

Even if you had all of the old classes, and their dependencies available from dynamic classloading, it still won't work UNLESS every new feature is designed with backwards compatibility with previous versions - a burden that is just too great when required of all Lucene code.

Given that, as has been discussed, there are other formats that can be used where isolated backwards persistence is desired (like XML based query descriptions). Even these won't work if the XML description references explicit classes - which is why designing such a format for a near limitless query structure (given user defined query classes) is probably impossible.

So strive for a decent solution that covers most cases, and fails gracefully when it can't work.

using standard serialization (with proper transient fields) seems to fit this bill, since in a stable API, most core classes should remain fairly constant, and those that are bound to change may take explicit steps in their serialization (if deemed needed)

asfimport commented 15 years ago

John Wang (migrated from JIRA)

The discussion here is whether it is better to have 100% of the time failing vs. 10% of the time failing. (these are just meaningless numbers to express a point) I do buy Doug's comment about getting into a weird state due to data serialization, but this is something Externalizable would solve. This discussion has digressed to general Java serialization design, where it originally scoped only to several lucene classes.

If it is documented that lucene only supports serialization of classes from the same jar, is that really enough, doesn't it also depend on the compiler, if someone were to build their own jar?

Furthermore, in a distributed environment with lotsa machines, it is always idea to upgrade bit by bit, is taking this functionality away by imposing this restriction a good trade-off to just implementing Externalizable for a few classes, if Serializable is deemed to be dangerous, which I am not so sure given the lucene classes we are talking about.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> This discussion has digressed to general Java serialization design, where it originally scoped only to several lucene classes.

Which classes? The existing patch applies to one class. Jason said, "If it looks ok, I will implement Externalizable in other classes." but never said which. It would be good to know how wide the impact of the proposed change would be.

asfimport commented 15 years ago

John Wang (migrated from JIRA)

For our problem, it is Query all all its derived and encapsulated classes. I guess the title of the bug is too generic.

As far as my comment about other lucene classes, one can just go to the lucene javadoc and click on "Tree" and look for Serializable. If you want me to, I can go an fetch the complete list, but here are some examples:

1) Document (Field etc.) 2) OpenBitSet, Filter ... 3) Sort, SortField 4) Term 5) TopDocs, Hits etc.

For the top level API.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

serialVersionUID added to the relevant classes manually. Defaulted to 10 because it does not matter, as long it is different between versions. Thought of writing some code to go through the Lucene JAR, do an instanceof on the classes for Serializable and then verify that the serialVersionUID is 10.

Term implements Externalizable.

SerializationUtils was adapted from WriteableUtils of Hadoop for writing VLong.

TestSerialization use case does term serialization and serializes an arbitrary query to a file and compares them.

TODO:

Implement Externalizable
More unit tests? How to write a unit test for multiple versions?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

SerializeUtils is missing from the patch.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> How to write a unit test for multiple versions?

We can save, in files, serialized instances of each query type from the oldest release we intend to support. Then read each of thes queries and check that it s equal to a current query that's meant to be equivalent (ssuming all queries implement equals well). Something similar would need to be done for each class that is meant to be transmitted cross-version.

This tests that older queries may be processed by newer code. It does not test that newer queries can be processed by older code. Documentation is a big part of this effort, that should be completed first. What guarantees to we intend to provide? Once we've documented these, then we can begin writing tests. For example, we may only guarantee that older queries work with newer code, and that newer hits work with older code. To test that we'd need to have an old jar around that we could test against. This will be a trickier test to configure.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

Added Externalizable to Document, Field, AbstractField (as compared to the previous patch). SerializationUtils is included.

TODO:

More Externalizable classes with test cases for each one

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Doug wrote: We can save, in files, serialized instances of each query type from the oldest release we intend to support. Then read each of thes queries and check that it s equal to a current query that's meant to be equivalent (ssuming all queries implement equals well). Something similar would need to be done for each class that is meant to be transmitted cross-version.

This tests that older queries may be processed by newer code. It does not test that newer queries can be processed by older code. Documentation is a big part of this effort, that should be completed first. What guarantees to we intend to provide? Once we've documented these, then we can begin writing tests. For example, we may only guarantee that older queries work with newer code, and that newer hits work with older code. To test that we'd need to have an old jar around that we could test against. This will be a trickier test to configure.

Makes sense. I guarantee 2.9 and above classes will be backward compatible with the previous classes. I think that for 3.0 we'll start to create new replacement classes that will not conflict with the old classes. I'd really like to redesign the query, similarity, and scoring code to work with flexible indexing and allow new algorithms. This new code will not create changes in the existing query, similarity, and scoring code which will remain serialization compatible with 2.9. The 2.9 query, similarity, and scoring should leverage the new query, similarity and scoring code to be backwards compatible.

asfimport commented 15 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> I guarantee 2.9 and above classes will be backward compatible with the previous classes.

It sounds like you are personally guaranteeing that all serializeable classes will be forever compatible. That's not what we'd need. We'd need a proposed policy for the project to consider in terms of major and minor releases, specifying forward and/or backward compatibility guarantees. For example, we might say, "within a major release cycle, serialized queries from older releases will work with newer releases, however serialized queries from newer releases will not generally work with older releases, since we might add new kinds of queries in the course of a major release cycle". Similarly detailed statements would need to be made for each Externalizeable, no?

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

LUCENE-1473.patch

Added some more Externalizables.

o.a.l.util.Parameter is peculiar in that it implements readResolve to override the serialization and return a local object to emulate enums. I haven't figured out the places this is used and what the best approach is to externalize them.

TODO:

Same as before

Doug wrote: ""within a major release cycle, serialized queries from older releases will work with newer releases, however serialized queries from newer releases will not generally work with older releases, since we might add new kinds of queries in the course of a major release cycle". Similarly detailed statements would need to be made for each Externalizeable, no?"

Serialized objects in minor releases will work. Serialized objects of older versions starting with 2.9 will be compatible with newer versions. New versions will be compatible with older versions on a classes by class basis defined in the release notes. It could look something like this:

Serialization notes: BooleanQuery added a scoreMap variable that does not have a default value in 3.0 and is now not backwards compatible with 2.9.
PhraseQuery added a ultraWeight variable that defaults to true in 3.0 and is backwards compatible with 2.9.

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

I really wouldn't want to add another backwards-compatibility requirement to Lucene, as the others already stated. Often in the past was ensuring backwards-compatibility the part of writing patches that took the longest and involved the most discussions.

But maybe we can come up with a fair compromise here. What if we change the classes that currently implement Serializable, so that they implement Externalizable and add a suid as Jason suggests. But we make clear in the javadocs that we don't support backwards-compatiblity here, so e.g. a Term externalized with Lucene 2.9 can't be read with 3.0, only with 2.9. Then we add a new class CustomExternalizableReader to util:

public abstract class CustomExternalizableReader {
  public abstract void readExternal(Object obj, ObjectInput in)
      throws IOException, ClassNotFoundException;
}

add a package-private, static variable of this type to a class that implements Externalizable and implement the deserialization code in a default instance of such a reader. This could look like this:

public class SomeClass implements Externalizable {
  private int one;
  private int two;

...

  static CustomExternalizableReader extReader = new CustomExternalizableReader() {
    public void readExternal(Object obj, ObjectInput in) throws IOException,
        ClassNotFoundException {
      SomeClass s = (SomeClass) obj;
      long uid = in.readLong();
      if (uid != serialVersionUID) {
        throw new IOException("Wrong serialVerionUID: " + uid);
      }
      int one = in.readInt();
      int two = in.readInt();
      s.init(one, two);
    }
  };

  // initialization method for readExternal
  void init(int one, int two) {
    this.one = one;
    this.two = two;
  }

Note that I also specified an init() method. Since both init() and extReader are both package-private, they are not protected by our backwards-compatibility policy and we can change them in any release.

Now if in the next version of this class we add a new variable 'three' we have to change init() and the reader:

public class SomeClassNewVersion implements Externalizable {
  private int one;
  private int two;
  private int three;

  static final long serialVersionUID = 2L;

  public void readExternal(ObjectInput in) throws IOException,
      ClassNotFoundException {
    extReader.readExternal(this, in);
  }

  public void writeExternal(ObjectOutput out) throws IOException {
    out.writeLong(serialVersionUID);
    out.writeInt(one);
    out.writeInt(two);
    out.writeInt(three);
  }

  /**
   * This reader can only read the externalized format created with the same
   * version of this class. If backwards-compatibility is desired, a custom
   * reader has to be implemented.
   */
  static CustomExternalizableReader extReader = new CustomExternalizableReader() {
    public void readExternal(Object obj, ObjectInput in) throws IOException,
        ClassNotFoundException {
      SomeClassNewVersion s = (SomeClassNewVersion) obj;
      long uid = in.readLong();
      if (uid != serialVersionUID) {
        throw new IOException("Wrong serialVerionUID: " + uid);
      }
      int one = in.readInt();
      int two = in.readInt();
      int three = in.readInt();
      s.init(one, two, three);
    }
  };

  void init(int one, int two, int three) {
    this.one = one;
    this.two = two;
    this.three = three;
  }

Now if someone tries to deserialize an object that was written with an old Lucene version, an exception will be thrown.

But the user can simply implement an own, backwards-compatible reader:

    // Now the user implements their own backwards compatible reader
    SomeClassNewVersion.extReader = new CustomExternalizableReader() {
      public void readExternal(Object obj, ObjectInput in) throws IOException,
          ClassNotFoundException {
        SomeClassNewVersion c_new = (SomeClassNewVersion) obj;
        long uid = in.readLong();
        int one = in.readInt();
        int two = in.readInt();
        int three;
        if (uid == 1) {
          // old version - initialze with default value
          three = -3;
        } else {
          // new version
          three = in.readInt();
        }
        c_new.init(one, two, three);
      }
    };

With this approach we have to clearly document in the javadocs the externalization format. Also if externalizable classes contain private inner classes that need to be serialized, then those inner classes have to be made package-private.

The nice thing here is that we allow backwards-compatibility, but push the burden of maintaining it to the user.

I coded this all up as an example that I'm attaching here. Let me know what you think, please. The patch file contains a Demo.java with a main method that demonstrates what I'm proposing here.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Often in the past was ensuring backwards-compatibility the part of > writing patches that took the longest and involved the most > discussions.

It very much still is, as I'm learning with #2532!

Your first example is missing the read/writeExternal methods.

I think the proposed approach is rather heavy-weight – we will have implemented readExternal, writeExternal, this new CustomExtenralizableReader, package private init methods, make private inner classes package private, the need to javadoc specifically the current externalization format written for each of our classes, the future need to help users to understand how they an achieve back compatibility by subclassing CustomExternalizableReader, etc.

I guess my feeling is all of that is a good amount more work than just deciding to directly implement back compatibility, ourselves.

EG, to do your example in a future world where we do support back compat of serialized classes (NOTE – none of the code below is compiled/tested):

First a util class for managing versions:

public class Versions {
  private int current;

  int add(String desc) {
    // TODO: do something more interesting with desc
    return current++;
  }

  void write(ObjectOutput out) throws IOException {
    // TODO: writeVInt
    out.writeByte((byte) current);
  }

  void read(ObjectInput in) throws IOException {
    // TODO: readVInt
    final byte version = in.readByte();
    if (version > current)
      throw new IOException("this object was serialized by a newer version of Lucene (got " + version + " but expected <= " + current + ")");
  }
}

Then, someone creates SomeClass:

public class SomeClass implements Externalizable {
  private int one;
  private int two;

  private static final Versions versions = new Versions();
  private static final int VERSION0 = versions.add("start");

  public SomeClass() {};

  public void writeExternal(ObjectOutput out) throws IOException {
    versions.write(out);
    out.writeInt(one);
    out.writeInt(two);
  }

  public void readExternal(ObjectInput in) throws IOException {
    versions.read(in);
    one = in.readInt();
    two = in.readInt();
  }

  ...
}

Then on adding field three:

public class SomeClass implements Externalizable {
  private int one;
  private int two;
  private int three;

  private static final Versions versions = new Versions();
  private static final int VERSION0 = versions.add("start");
  private static final int VERSION1 = versions.add("the new field three");

  public SomeClass() {};

  public void writeExternal(ObjectOutput out) throws IOException {
    versions.write(out);
    out.writeInt(one);
    out.writeInt(two);
  }

  public void readExternal(ObjectInput in) throws IOException {
    int version = versions.read(in);
    one = in.readInt();
    two = in.readInt();
    if (version >= VERSION1)
      three = in.readInt();
    else
      // default
      three = -3;
  }

  ...
}

In fact I think we should switch to Versions utils class for writing/reading our index files...

asfimport commented 15 years ago

Michael Busch (migrated from JIRA)

Your first example is missing the read/writeExternal methods.

Oups, I forgot to copy&paste it. It's in the attached patch file though.

I think the proposed approach is rather heavy-weight

Really? In case we go the Externalizable way anyway, then I think this approach doesn't add too much overhead. You only need to add init() and move the deserialization code from readExternal() to the reader's readExternal. It's really not too much more code.

And, the code changes are straightforward when the class changes. No need to worry about how to initialize newly added variables if an old version is read for example.

What I think will be the most work is documenting and explaining this. But this would be an expert API, so probably people who really need to use it are most likely looking into the sources anyway.

But for the record: I'm totally fine with using Serializable and just adding the serialVersionUID. Just if we use Externalizable, we might want to consider something like this to avoid new backwards- compatibility requirements.

apache / lucene

Implement standard Serialization across Lucene versions [LUCENE-1473] #2547