change index backwards compatibility policy. [LUCENE-5940]

asfimport commented 10 years ago

Currently, our index backwards compatibility is unmanageable. The length of time in which we must support old indexes is simply too long.

The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help.

Currently our back compat "promise" is just a broken promise, because we cannot actually guarantee it for these reasons.

I propose we scale back the length of time for which we must support old indexes.

Migrated from LUCENE-5940 by Robert Muir (@rmuir), updated Dec 31 2020 Linked issues:

6555
- 6969
- 6996
- 7001
- 6441
- 6525
- 6189

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

Big 1. Our current policy has us supporting indexes 4 years old, and given how long 4x is lasting, that will just keep stretching. Obviously there needs to be an upgrade path, but I don't think it needs to be so easy for someone that hasn't upgraded in 4 years.

My concrete proposal is supporting the current major release, plus the last minor release of the previous major release. That should provide an upgrade path by first updating to the last minor release of the major release you are using, followed by the lastest of the next major release. Given the 4.x architecture with codecs, this should be much easier than it has been to maintain 3x index formats.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

My concrete proposal is supporting the current major release, plus the last minor release of the previous major release.

That is what I was thinking about as well when reading the issue description. Having to keep bw compat for all 4.x codecs once 5.0 is released would be a nightmare.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

i understand the desire for changing the policy here. i wish i didn't have to care about backwards compat support, but its just the nature of things. people have large indexes that can take a significant amount of time to reindex (due to a slow source, or complex processing)

the current proposal here would be problematic for any lucene users who do not release versions in lock step with lucene versions. Solr obviously would have limited issues here since a user could just upgrade to solr 4.99 (assuming 4.99 is the final 4.x version) and then solr 5.0 and no problems.

however, if product X released with lucene 4.88 and the last minor version in 4.x line was 4.99, then the upgrade process to get to a lucene 5.0 index is now convoluted and will require creation of custom offline tools to provide an upgrade path. This backwards compatibility requirement is now just shifted from the lucene devs to the lucene users and can no longer be a seamless transition.

the current policy does not have these issues since all that i would need to do is fire up the next version, do a forceMerge, and everything is up to date on latest codecs. (no offline processes required, search can continue to work during upgrade)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Tim but the "policy" is really a joke. it just locks things in with releases.

Currently if you are on lucene 2, then in order to get to lucene 4, you have to move to 3.x first.

If we released 5.0 right now, we would not have to deal with 3.x indexes anymore. We could release 6.0 e.g. within a year of that, and we'd contain the problem.

I think if i actually proposed 5.0 and took it seriously, no one would really complain. But its bogus to do this and issue releases with not so many features just because it makes everyone feel better, when its really the policy that is broken. That is what we should fix.

This is from someone who has spent the the last 2 days doing nothing but fight back compat in lucene.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Having to keep bw compat for all 4.x codecs once 5.0 is released would be a nightmare.

Right, the fallback plan is to release 5.0, then rapidly release 6.0 (maybe just a few days after) so we can drop all the shit. That doesn't require a change to the backwards compatibility policy. But i hope everyone understands how ridiculous that is when we can just be reasonable instead.

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

however, if product X released with lucene 4.88 and the last minor version in 4.x line was 4.99, then the upgrade process to get to a lucene 5.0 index is now convoluted and will require creation of custom offline tools to provide an upgrade path. the current policy does not have these issues since all that i would need to do is fire up the next version, do a forceMerge, and everything is up to date on latest codecs. (no offline processes required, search can continue to work during upgrade)

We have a tool that does this without forceMerge. It just upgrades those segments that need upgrade and writes a new commit point. It is called IndexUpgrader and has a main method.

My idea would be to privtde that tool, including all stuff as a self-executing JAR file, so you just need: java -jar lucene-indexupgrader-4.10.0.jar indexdir (basically it is already like that, but you need to build classpath and command line manually).

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

5.0 should not be saddled with supporting 3.x index. 100% agree there

however, 5.0 should ideally continue to support 4.0-4.99 indexes (at least from the codec/index reading perspective)

the best place to handle backwards compat is in the core of lucene. otherwise, you are just going to have uses all over the place doing their own interpretation of "backwards compat", getting it wrong, broken, etc. and will subsequently result in lots of irate user filing tickets.

if you only support the last minor version from the previous release, it makes it difficult for everyone who was not at that exact minor release.

also, to uwe's point the "indexupgrade" tool is an offline process. also, in my situation, i would need custom packaging of that tool in order to provide ease of use/proper codec usage, etc. vs just fire up index on 5.0 and forceMerge. the custom packaging would also require including an "old" version of lucene in my project that would be packaged separately, and would just be a nightmare to maintain.

alternatively, i would just grab the source for all removed 4.x codecs i need and pull them into my project (this is not ideal since they are no longer maintained by lucene devs and may have dependency issues that would require porting)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

however, 5.0 should ideally continue to support 4.0-4.99 indexes (at least from the codec/index reading perspective)

Who will do the work? Who will maintain this?

Won't be me.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Maybe another option would be to have a policy that is purely time-based? Eg. codecs would be removed, even in minor releases, when they have not been the default codec for more than one year?

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

"firefox" release policy is the other option. We can just release new major versions every few months and keep things contained.

We can do this now, without any change to the "policy" :)

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

time based would be much more reasonable

as long as people are on a 4.x release that is less 1-2 years old, they should be able to move directly to 5.0

supporting indexes 4+ years old is asking a bit much, but assuming an external release cycle of 1 year, a 1-2 year cutoff is manageable

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

"firefox" does not need to worry about an upgrade path for terabytes worth of data

they only need to worry about upgrading bookmarks and thats about it

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I would also agree with a time-based policy.

About the proposed "indexupgrader" JAR file: The idea was to make it self-contained: means it contains all classes that are needed for upgrade, ideally jarjared (or Maven Shade plugin) to have a different package name containing the version. You can then also bundle it with a project and then start the upgrade.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The way i see it, back compat is just like any other feature. If people dont step up to contribute to make it happen, then we drop it.

I'm done wasting days and days on it when i don't care about it.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

i fully understand the pain associated with maintaining back compat

i guess it would be good if you (and others) could enumerate all the issues related here for full perspective (description does not list them)

also, it should be on the developer who removes write support (or removes a codec) to add the backwards compat support/testing.

creating a new codec that supplants an old codec should not inherently require removal of write support for old codec.

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

Maybe another option would be to have a policy that is purely time-based?

I had thought about this before making my suggesting, but I think this has the problem of being very arbitrary, and hard to know what upgrade path is needed. For example, if the policy is 1 year, and I am at 4.3, and the latest is 5.6, how do I know what I need to upgrade to in order to get to 5.6? Is it 5.3.1 or 5.2.4? I think maintaining this table as old versions are dropped would be difficult in itself.

My idea would be to privtde that tool, including all stuff as a self-executing JAR file

This is a great idea! In fact, I think we can make one better. We could provide this tool, as well as a "meta" tool, which knows how to download those tools for each release. It could then output something like:

Found index version 4.3.2
Latest version is 6.7.0
Upgrading index to 4.99.0...done
Upgrading index to 5.99.0...done
Upgrading index to 6.7.0...done

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

the problem with the upgrade tool approach is that it doesn't scale to clusters with large numbers of indexes.

for instance, a cluster that has 50 indexes spread across a bunch of machines. this is now an involved manual task put in the hands of system administrators who don't really know whats going on under the hood.

thats just asking for trouble

it seems like the whole power of codecs is that you can avoid all this and allow for seamless transitions by having read only codecs for previous index formats.

are there technical issues here i'm unaware of beyond creating and maintaining the backwards compat tests? something outside of the codec mechanism that causes problems?

if not, just dump the read only codecs for old versions in an contrib module and let people upgrade at their leisure (and let the community find/fix bugs as they are encountered)

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

if not, just dump the read only codecs for old versions in an contrib module and let people upgrade at their leisure (and let the community find/fix bugs as they are encountered)

Already done in Lucene trunk: There is a new backwards module. In trunk you can read previous indexes only with this jar is the classpath (loaded via SPI).

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

are there technical issues here i'm unaware of beyond creating and maintaining the backwards compat tests? something outside of the codec mechanism that causes problems?

There are plenty, first of all, maintaining back compat codecs has a real cost to improving lucene in the future, because if e.g. I want to make a change to the codec API, i have to make deal with tons of medieval index formats. Same goes with structural changes like making docvalues updatable (shai had to fight a lot here). Even stuff like simple code refactoring is expensive because its just a ton of code.

Also the old codecs hang behind on features. They might not support various features like offsets in the postings, payloads in the term vectors, missing bitsets for docvalues, or whole datastructure types (SORTED_SET/SORTED_NUMERIC), or even whole parts of the index (3.x with docvalues at all). They are missing various useful statistics, etc. These are just ones i've worked on myself recently, there are more, and there are more coming (like Mike's range prefix feature). This makes things like testing difficult.

Backwards compat drags around a lot of stuff for a long time (see the packed ints api) that makes it more complex and hard to work with and make changes to. It prevents and discourages real improvements to lucene.

There are plenty of bugs in the back compat, the last few indexes have been riddled with them, some of them bad. Its undertested, overcomplex, and undermaintained. Again, not sexy stuff to work on, nobody wants to improve it.

Finally, users want to have more options, but until we can minimize this backwards compat, i'm personally going to push back very hard on any "options", because we simply cannot take on more back compat. So the codec API goes mostly wasted. Maybe we should rename it "backcompat" api, because thats all its currently good for. Backcompat hurts the users here in this case. If we didn't have so many ancient formats, we could instead provide (and actually support) "breadth" instead, such as various options for the way to encode data so users really can take advantage of it.

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

the problem with the upgrade tool approach is that it doesn't scale to clusters with large numbers of indexes.

Can you elaborate more? Your example of 50 indexes spread across many machines doesn't make me understand how it would be difficult to run this tool. I see the steps as:

Install the newest lucene (you would already have to do this)
Run the meta tool. This will download the necessary indexupgrader self contained jar for previous releases, and follow the upgrade path to get to the current release.

are there technical issues here i'm unaware of beyond creating and maintaining the backwards compat tests?

I'd just like to reiterate what Robert said. Have you looked at how much code is involved in maintaining backcompat? Just for the current 3x and 4x, it is enormous. And you can't assume the codec API will stay the same. Changing the codec api means updating old codecs in some way that they still work as expected (Robert's example with updateable DV). Minimizing that effort for a developer allows more rapid experimentation and iteration.

The advantage to the indexupgrader tool Uwe described is it is completely self contained. All the old codecs are there, and when that jar was created, it was tested thoroughly with the upgrade paths it supports. But those old codecs and upgrade paths don't have to be in the current codebase, which makes changing the current code easier.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree, thats the worst part of all. "trunk" should not be burdened with this stuff, but its already overwhelmed completely with back compat.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

i would not consider old indexes not containing support for new features an issue. if you want to use new options/features/structures, you need to reindex, no problem here.

you don't have to convince me that supporting back compat sucks. i agree, but lucene is used by a lot of people for a lot of disparate use cases. removing support for back compat will drive people away since it removes seamless upgrade paths.

think what would have happened if microsoft release 64-bit windows with no support for running old 32-bit programs. people still want to run old dos programs on windows (go figure, but they want/need it)

it hurts adoption of new versions if you don't provide the back compat. this just leaves a bunch of people running ancient versions of lucene because they don't have any good upgrade path other than complete reindexing.

if there is a bug in "feature x", a possible solution is to just remove "feature x", but this is gonna piss off everyone who relies on it, regardless of how much you may personally hate "feature x"

the main thing i see as a challenge that you mention here is that you want (or new features may require) refactoring the codec api.

this is an engineering challenge and would just require some thought out design to decide what "final api refactors" should be needed to support flexibility, addition of new features, and growth without requiring mucking with old codecs in the future.

right now, the IndexWriter and codecs are pretty muddled together in some cases. cleaning up these interfaces and making the codecs self contained should be a goal for any refactors to allow future innovation/addition of features.

as a lucene user, if back compat is yanked and not provided in 5.0 for all 4.x indexes, i will be extremely resistant to upgrade. I would be more inclined to fork the latest 4.x and ditch 5.0. 5.0 would have to offer something REALLY compelling to get me to adopt it.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i would not consider old indexes not containing support for new features an issue. if you want to use new options/features/structures, you need to reindex, no problem here.

Because you are not even considering the developer pain. The tests man, maintaining the tests.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

Can you elaborate more? Your example of 50 indexes spread across many machines doesn't make me understand how it would be difficult to run this tool. I see the steps as:

here's the issues i would have with an "upgrade tool" approach here.

external network connectivity is not guaranteed
i have special metadata written in the segment metadata that is important
i use custom codec configuration that upgrade tool would need to use
replicated indexes need a lot of care
this tool would need to be run once for each directory containing an index, for every node that contains indexes
- this is an ops nightmare since i won't personally be running the tool. this leaves lots of room for user error that is avoided completely if the index upgrade is seamless (via read only codecs for old versions)
  1. custom directory implementations may muck up the works

in general, i don't see any way this "upgrade tool" would be useful to me without repackaging and adding a ton of extra code to do all the things i need to ensure a consistent index is emitted

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No offense Tim, but your comments exactly fit my description of this issue.

The index back compat works like this: everyone wants it, but there are frequently bugs, and when push comes to shove, its not a very sexy thing to work on/fix, so its hard to get any help.

I don't care what happens on this issue, personally, I'm done working on back compat completely until the policy changes. That includes the current in-progress 4.10.1 release. I've done more than my fair share of fighting it, and it just causes me endless frustration.

If people care about back compat, then they can go do things like regenerate indexes from previous lucene versions to ensure they arent buggy like #7001 and that its actually working. They can try to refactor out old cruft in some way and work on improving the APIs of "dead index formats".

But thats not for me.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

Because you are not even considering the developer pain. The tests man, maintaining the tests.

the pain will continue to exist, you are just shifting who feels it. again, i get how painful it is, but best to have that pain felt at the source (and handled properly and consistently by people who fully understand it) as opposed to pushing it all downstream, polluting the waters

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No thats not correct. what you are saying there is "fuck you man, you do the work".

I will not.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

I don't care what happens on this issue, personally, I'm done working on back compat completely until the policy changes. That includes the current in-progress 4.10.1 release. I've done more than my fair share of fighting it, and it just causes me endless frustration.

fully your prerogative, this is a volunteer community.

i'm just putting in my 2 cents here since a change here will really be painful to me personally

of course i'm not a committer, so i have no final say

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to relax the policy

+1 for the ".99" approach: I think it's easier to "grok" than the time-based approach.

But if we do relax the policy I think we should also improve IndexUpgrader (or make a new top-level tool, which is what we expose to users, hiding the current IndexUpgrader, i.e. @rjernst's idea) to do this upgrade across any 4.x to any 5.x (or across more than 1 major release).

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

of course i'm not a committer, so i have no final say

Tim, please don't think that we are trying to ignore your concerns. While I understand your frustration (more work), I don't think the pain you could feel is really any different than today? There is no specific measurement that goes into what constitutes enough work for a release, just community sway. Technically, if someone is willing to do the work (#7006), and there are 3 +1's, and more +1's than -1's, a release can happen. I don't mean this as a threat, I only mean it to demonstrate how arbitrary the process can be, not guaranteeing you any kind of time between major releases. Because of this, you could be in the same situation you described with the shorter BWC policy.

The suggested policy would greatly simplify the work needed on the development side, and give us a clean slate for each major release. And at the same time, I think this could theoretically extend the ability to upgrade old indexes over a longer span . The meta tool I have proposed could be the link between all major versions. All it needs to do is be able to read what version an index was written with, so it knows the major version (and this ability can be segregated to that tool, as this should be relatively simple to copy if how to do that changes). I think this is much more powerful than today's policy, while at the same time allowing the API to be improved in significant ways across major releases, compared to now, where it cannot really change without enormous effort because of the need to continue reading the entire previous major version.

So from a user perspective, we want to make this work; it is not just for developers. Your main concerns seem to be about the tool being offline, the writing special segment metadata, and the network connectivity to grab the old upgraders.

First, I don't see a way around it being offline; the apis between major versions could differ in significant ways. But it is no different than if you had a 3x index today, and we released 5.0 tomorrow: you would first have to upgrade to a 4x index, why wouldn't you upgrade to 4.99? And that process would have to be offline, so adding an additional step of first going to 3.99 doesn't seem unreasonable.

Regarding special metadata, I think most users are just using the default codec as written. When you use non default setup, it will (most likely always) require additional work. I understand this pain, but it is pain you have put upon yourself. But if you already have code for 4x, then upgrading to 4.99 before changing your code to work with 5.0 should not be difficult, since within a major release the APIs should be stable.

As for network connectivity, it seems like this could just be a packaging issue? Would it help if each release had the metatool containing the necessary subjars for each previous release, so that it would not have to download (it would just make it a bit bigger)?

As developers we need this to happen, to maintain any kind of sanity in our ability to guarantee compatibility. As users you want backward compatibility to work as long as possible. I think this would actually serve both purposes, in a way that is advantageous for both sides.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

i fully understand the reasons for wanting to change the policy here. i absolutely hate maintaining backwards compat myself. its just a nightmare and leaves lots of rotting code laying around waiting to wreak havoc and makes it dicey to add new functionality. i'm fully on board with that sentiment

but, i have to support it, and do so in a seamless online manner that is not prone to user error.

i also get the feeling a lot of the lucene devs in general don't think "full reindexing" is an issue and can just be done at any point with minimal cost (just a vibe i've picked up). my experience is that this can be a many months long process (slow sources). this seems to influence support for backwards compatibility, as well as support for changing configuration/schema options, for existing fields, etc

by all means, create a good upgrade tool people can use. however, it won't be useful for me and i will need to find a different solution (which will likely result in slowing my adoption of 5.0 when it is released)

i am in no way advocating that 5.0 should support reading 3.x indexes.

again, i'm just adding my perspective here so informed people can make a decision based on all points of view

if the policy changes, i will just have to adapt as necessary

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Actually 5.0 doesn't even need to read 4.x indexes. I had forgotten when I opened this JIRA issue that we already voted on this in 2010. (this vote passed).

[VOTE] Take 2: Open up a separate line for unstable Solr/Lucene development

This is a vote for the proposal discussed on the 'Proposal about
Version API "relaxation"' thread.  This thread replaces the first
VOTE thread!

The vote is to open up a separate parallel line of development, called
unstable (on trunk), where non-back-compatible changes, slated for the
next major release, may be safely developed.

But it's not a free for all: the back compat break must still be
carefully tracked in detail (maybe in CHANGES, maybe in a separate
more detailed "guide" -- tbd), including migration instructions, so
that this becomes the "migration guide" on how users can move to the
new major release.  If there are changes that break the index, we will
try very hard to create an index migration tool.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The meta tool I have proposed could be the link between all major versions.

I agree. In fact its the current "policy", we voted in it 4 years ago, Uwe even wrote the tool, but everyone forgot :)

A few notes:

First of all, people dont realize that you cant take your 3.x index to 4.0, issue some commits, and bring it to 5.0 and use it. Its always been this way, you have to actually ensure every single segment is in the supported format. So some "upgrade process" is always necessary (a forceMerge(), or use of IndexUpgrader).
There are bugs in this today, because we dont test the "partially supported" situation. For example if someone takes their 3.x index today to 4.0, kisses it with some commits, but it still have some 3x segments, then tries to read it with trunk, AFAIK they wont get IndexFormatTooOldException. Instead they will get a confusing SPI failure for "Lucene3x". So really, we should make a TooOldCodec that throws IndexFormatTooOldException and register it in SPI with every single codec that is unsupported.
Finally, IMO we have an upgrade tool. pretend we cut 5.0 today. The instructions are simple, just run java -cp lucene-4.10.0-core.jar org.apache.lucene.index.IndexUpgrader <index> and you are ready.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

i also get the feeling a lot of the lucene devs in general don't think "full reindexing" is an issue and can just be done at any point with minimal cost (just a vibe i've picked up).

You're definitely not wrong here. When first getting an index into production (often daily or even more frequently), and later when the application needs change, the user must make changes to the code (Lucene) or schema (Solr, elasticsearch, or other product) that are incompatible with the existing index. When a user obtains help from a mailing list or other support resource, such a change is VERY likely.

Reindexing is part and parcel of search. Users who are unable to efficiently perform a reindex will usually find themselves without the search capabilities that they really need, because they made incorrect early assumptions that can't be fixed without reindexing. This can be the case even if they go years without upgrading their Lucene libraries. If the actual source data is difficult to obtain, it's strongly recommended that it be gathered into an intermediate store with excellent retrieval characteristics, such as a database.

asfimport commented 10 years ago

Tim Smith (migrated from JIRA)

Reindexing is part and parcel of search

i think the general goal should be that this is not the case, especially as search is adopted more and more as replacements for systems that do not have these limitations/requirements (databases). obviously this is an ambitious goal that can likely never be fully realized.

also, "reindexing" comes in 2 distinct flavors:

cold reindexing - rm -rf the index dir, re feed
- requires 2x hardware or downtime
live reindexing - change config, restart system, re feed all docs, change is "live" once all docs have been reindexed
- obviously a good idea to snapshot any previous index and config so you can restore later on error
- minimal downtime (just restart)
- minimal search interruption (some queries related to the change may not match old documents until reindex is complete)
- old content can be replaced slowly over time to receive full functionality

live reindexing does have lots of pitfalls and may not always be viable. for instance, right now it is not possible to add offsets to an index using this approach. as soon as the a new segment is merged with an old one, the offsets are blown away. i had filed a ticket for this. i'm not looking to reopen old wounds here, just pointing out an issue i had with this and had to work around.

live reindexing is the goal i strive to achieve when reindexing is required (always comes with a caveat to backup your index first for safety). some smart choices when designing the internal schema can reduce or eliminate many prospective issues here even without any core changes to lucene.

it's strongly recommended that it be gathered into an intermediate store

these recommendations are always valid to make (and i will make them), however this adds an entire new system to the mix. as well as new hardware, services, maintenance, security, etc. also, given the scale and perhaps complexity of the documents, this may not even be enough and will still require a large amount of processing hardware to process these documents as fast as the index can index them in a reasonable amount of time (days vs months). in general, this is just extra complexity that will be dropped due to the higher price tag and maintenance cost. then, when it finally is time to upgrade the end-user expectation is that "oh, we already have the data indexed, why can't we just use that with the new software". this expectation is set due to the fact that many customers/users are used to working with databases. i do not have this expectation myself, however i have people downstream that do have these expectations and i need to do my best to accommodate them whether i like it or not.

note, i'm not trying to force any requirements on lucene devs, or soliciting advice on specific functionality, just pointing out some real world use cases i encounter related to discussion here.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

also, "reindexing" comes in 2 distinct flavors:

cold reindexing - rm -rf the index dir, re feed

requires 2x hardware or downtime

live reindexing - change config, restart system, re feed all docs, change is "live" once all docs have been reindexed

obviously a good idea to snapshot any previous index and config so you can restore later on error

minimal downtime (just restart)

minimal search interruption (some queries related to the change may not match old documents until reindex is complete)

old content can be replaced slowly over time to receive full functionality

I use Solr. My reindexing method is actually a combination of the two you've mentioned. For every shard, I have a live core and a build core. When a reindex is required, I start importing from my database into the build cores. In the meantime, the live cores are still being updated once a minute with new data and deletes. When the full import is done, I apply all relevant changes to the build cores, then swap them with the live cores. Once that copy of my index is rebuilt, I re-enable it so that the load balancer can use it again.

asfimport commented 3 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

@rmuir Can we close this? I ran across it while searching for something else....

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I'll echo Erick's question, and close soon if there isn't any further comment. As I understand it, we did this. The current (de facto, at least) policy is to support the last major release with backcompat, no? I found this documented here: https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility maybe it's elsewhere too?

apache / lucene

change index backwards compatibility policy. [LUCENE-5940] #7002

6555

6969

6996

7001

6441

6525

6189