apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

Add runnable index upgrader [LUCENE-5956] #7018

Open asfimport opened 10 years ago

asfimport commented 10 years ago

As a spinoff from discussion in #7002, I'd like to add a new module "lucene-upgrader", move IndexUpgrader to this, and add embed older versions of lucene (just enough to upgrade indexes) in the built version of the module's jar. This would be runnable from the command line with something like: java -jar lucene-upgrader-4.11.0.jar /path/to/index


Migrated from LUCENE-5956 by Ryan Ernst (@rjernst), updated Jul 10 2015

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi, a few questions:

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

should it really be separate from the current 'backwards-codecs' module? Not saying if its right or wrong, but thats where the tests for the upgrader are.

Sure, it could be in the same module. I just though "running" the backwards codecs sounded weird from a user perspective (not obvious at all)

how valuable is it to support older than N-1 indexes, when the analysis chains used to create them have now changed?

Yeah, I'm not sure. It's a good point... how would analysis be upgraded at all except by reindexing? Implying really that to go across major versions you need to reindex?

how will SPI etc work here, will the older versions somehow have their own SPI lists in different classloader or something?

Well, I think the only overlapping SPI would be with the last minor release? e.g. 4.11 upgrader and 5.0 upgrader would both be able to read 4.11. But yeah...need to figure it out.

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

Regarding analysis, I think this is already a problem, since you may be using the newest analysis chain, but have not rewritten these ancient 3x segments which contain terms produced by the old analysis chain. But in reality, unless drastic changes happen in analysis, this is usually "ok"? You will have some terms that are never found in either direction, but it shouldn't break all queries. So to be safe, you probably want to reindex your data at least every major version, but if the data is ancient (for example, very old time based indexes) perhaps just being able to search them at all is good? So with this tool that makes it possible, vs trying to make the search uptodate as far as analysis.

asfimport commented 9 years ago

Trejkaz (migrated from JIRA)

> how valuable is it to support older than N-1 indexes, when the analysis chains used to create them have now > changed?

I'm currently writing a library to migrate from version 2 to 5. For most active indices it will be from 3 to 5. Which is to say, 100% of our existing users are on N-2. Nobody is using N-1 because we never released a supported version using that version. I'm sure there will be other companies in the same position.

Maintaining backwards compatibility for analysis is certainly possible and we're doing it. It's more work, particularly when the analysis API makes big changes, but as far as I know, we don't have the option to reindex because people complain about reindexing when their initial index took a month to create in the first place.

> how will SPI etc work here, will the older versions somehow have their own SPI lists in different classloader or > something?

What I'm trying to do, which should work, and appears to be working, is to rename all org.apache.lucene for a given version to some.other.place.luceneN. At the moment, at least classname lookups appear to be working. (Other weird things are going on, but I don't have the foggiest idea what's going on there yet.)

asfimport commented 9 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

What I'm trying to do, which should work, and appears to be working, is to rename all org.apache.lucene for a given version to some.other.place.luceneN. At the moment, at least classname lookups appear to be working. (Other weird things are going on, but I don't have the foggiest idea what's going on there yet.)

You have to load them in different classloaders, see my reply on user mailing list:

Hi,

It could be the reason for this is your classpath:

If you load all Lucene Versions into the same classloader (but with different package names - I assume you use Maven Shade plugin to do this), Lucene 3 will load perfectly, yes; Lucene 4 will also load perfectly, yes! But when it tries to load Lucene 5, it will fail to load all shipped codecs. Codecs are not identified by their Java package name, but by the symbolic name (like "Lucene50") as written into the index. The SPI interface of Lucene will load all codecs from classpath and save them in a lookup map based on the symbolic name. If the Lucene 4 JAR file are placed before Lucene 5 JARs, the "slots" for codec names are already taken (because the Lucene 5 loader will see the Lucene 4 codecs first), so loading Lucene 5 variants of old codecs is a no-op. This may cause those problems, because Lucene 5 ships with "modified" versions of the old Lucene 4 codecs - but they are not identical.

You can only workaround by loading the Lucene JARs into completely different classloaders (don't forget to also set context classloader!). In that case you would not even need to change package names!

asfimport commented 9 years ago

Trejkaz (migrated from JIRA)

The class loader approach is definitely turning out to be problematic. It all worked in testing, but now that it's in a jar, it looks like I have found a bug in JarURLConnection on the Oracle JVM where it's calling indexOf instead of lastIndexOf to split the jar URL, which results in failing to parse the URL correctly and throwing a MalformedURLException.

URLClassLoader uses java.net.URL exclusively to load resources, so it can't be used.

I'm not sure how other people work around this but I guess the options are:

(a) Unpack the embedded jar to a normal file path. (b) Write a completely new URLClassLoader which avoids using URL to do any loading, except of course the jar file itself has to be accessible via URL, because that's the only thing ClassLoader will give me to work with. (c) Avoid using nested jars and just nest the actual class files, but off in a hidden directory. But still treat them as resources, so the original names can be used. (d) Give up and just go back to renaming the Lucene packages and suffer the autocomplete encouraging completing the wrong class from time to time. (e) Other ideas I haven't considered.