apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.03k forks source link

Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter [LUCENE-2960] #4034

Closed asfimport closed 13 years ago

asfimport commented 13 years ago

In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. It would be great to be able to control that on a live IndexWriter. Other possible two methods that would be great to bring back are setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other setters can actually be set on the MergePolicy itself, so no need for setters for those (I think).


Migrated from LUCENE-2960 by Shay Banon (@kimchy), resolved Mar 18 2011 Attachments: LUCENE-2960.patch

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Pulling back into 3.1 as blocker.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If you call IWC.setRAMBufferSizeMB even after the IW is init'd, this almost works; I think the only problem is that DocumentsWriter has a private copy of the RAM buffer size. We can simply fix that (it can pull from config, too), and then re-enable the test we used to have that asserts that live changes to the RAM buffer take effect.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

IMO, this is such an obscure usecase that we shouldn't introduce any extra complexity to implement it. For example, we should not guarantee exactly when a change to setRAMBufferSizeMB will be seen... that could introduce the need for extra synchronization / memory barriers.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

my opinion (thanks for setting to 3.1) is solely based on how we do releases:

I just don't think we should deprecate things in 3.1, then un-deprecate in 3.2...

Really this is our mistake, we shouldn't have committed IWC and left TestIndexWriter.testChangingRamBufferSize or whatever using the deprecated setter without hashing this out further: because this test would fail if it tried to use IWC.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

IMO, this is such an obscure usecase that we shouldn't introduce any extra complexity to implement it.

That this comes from Shay makes it not an "obscure" use case, IMO.

Ie, Shay (Elastic Search) is doing awesome things with Lucene, so if some change in Lucene is adversely impacting Elastic Search, we should listen.

We already don't hear from Shay nearly as often as we should ;)

For example, we should not guarantee exactly when a change to setRAMBufferSizeMB will be seen... that could introduce the need for extra synchronization / memory barriers.

That's really an impl. detail. Ie, what really matters is whether or not this change results in any real change to index throughput. If a volatile read per indexed doc (w/ rare volatile write) really costs too much (unlikely) then we can make this "best effort".

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Really this is our mistake, we shouldn't have committed IWC and left TestIndexWriter.testChangingRamBufferSize or whatever using the deprecated setter without hashing this out further: because this test would fail if it tried to use IWC.

I agree. Really we should not commit API changes w/o also cutting over all but a few (so we still test the deprecated path) usages of the deprecated API to the new one (and addressing whatever issues that uncovers).

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

For the depreciated path we should have backwards tests...

asfimport commented 13 years ago

Earwin Burrfoot (migrated from JIRA)

As I said on the list - if one needs to change IW config, he can always recreate IW with new settings. Such changes cannot happen often enough for recreation to affect indexing performance.

The fact that you can change IW's behaviour post-construction by modifying unrelated IWC instance is frightening. IW should either make a private copy of IWC when constructing, or IWC should be made immutable.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

As I said on the list - if one needs to change IW config, he can always recreate IW with new settings.

That's not really true, in general. If you have a large merge running then closing the IW can take an unpredictable amount of time. You could abort the merges on close, but that's obviously not great.

Furthermore, closing the IW also forces you to commit, and I don't like tying changing of configuration to forcing a commit.

In fact, it doesn't make sense to me to arbitrarily prevent settings from being live, just because we've factored out IWC as a separate class. Many of these settings were "naturally" live before the IWC cutover, and have no particular reason not to be (besides this API change).

We could also rollback the IWC change. I'm not saying that's a great option, but, it should be on the table.

InfoStream, for example, should remain live: eg, maybe I'm having trouble w/ optimize, so, I turn on infoStream and then call optimize.

The flushing params (maxBufferedDocs/Deletes/RAM) should also remain live, since we have a very real user/data point (Shay) relying on this.

But take MergedSegmentWarmer (used to be live but is now unchangeable). This is a setting that obviously can easily remain live; there's no technical reason for it not to be. So why should we force it to be unchangeable? That can only remove freedom, freedom that is perhaps valuable to an app somewhere.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

InfoStream, for example, should remain live

Agree - it's logging.

But take MergedSegmentWarmer (used to be live but is now unchangeable). This is a setting that obviously can easily remain live; there's no technical reason for it not to be.

Anyone's implementation can be live (i.e. the impl could change it's behavior over time for whatever reason).

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Anyone's implementation can be live (i.e. the impl could change it's behavior over time for whatever reason).

Well, that's really cheating. I mean, yes, technically it's an out, so it's certainly possible that an app can do the switching inside its class... but that's not nice :)

EG if an app has LoadsAllDocsWarmer and VisitsAllPostingsWarmer (say) and they want to switch between (for some reason)... they'd like have to make a SegmentWarmerSwitcher class or something.

Seems silly because IW could care less if you switch up your warmer. It just needs to get the current warmer every time it goes and warms a segment...

asfimport commented 13 years ago

Earwin Burrfoot (migrated from JIRA)

Furthermore, closing the IW also forces you to commit, and I don't like tying changing of configuration to forcing a commit.

Like I said, one isn't going to change his configuration five times a second. It's ok to commit from time to time?

So why should we force it to be unchangeable? That can only remove freedom, freedom that is perhaps valuable to an app somewhere.

Each and every live reconfigurable setting adds to complexity. At the very least it requires proper synchronization. Take your SegmentWarmer example - you should make the field volatile. While it's possible to chicken out on primitive fields (except long/double), as Yonik mentioned earlier, making nonvolatile mutable references introduces you to a world of hard-to-catch unsafe publication bugs (yes, infoStream is currently broken!). For more complex cases, certain on-change logic is required. And then you have to support this logic across all possible code rewrites and refactorings.

asfimport commented 13 years ago

Shay Banon (@kimchy) (migrated from JIRA)

Heya,

If I had to choose between being able to change things in real time to better concurrency thanks to immutability, I would definitely go with better concurrency. I have no problems with closing the writers and reopening them, though, as Mike said, this can come with a big cost.

The funny thing is that a lot of the setters that were already there on the IndexWriter are still exposed, basically, through settings on the relevant MergePolicy, so I don't think we are talking about that many setter to begin with (I don't think we should bring those back to the IndexWriter).

I think that the notion of IWC is a good one, and should remain, but only to provide construction time parameters to IW. It should not be consulted once the construction phase of IW is done. If explicit real time parameters are to be set, then IW should expose it as a setter. Now, the question is which, if any, setters should be exposed.

Going through the list of current setters on IW, my vote is for the setRAMBufferSizeMB one. I am not sure that its that obscure use case. I believe Solr for example has a notion of cores (or something like that), so it can also be adaptive in terms of indexing buffer size dependent on the number of cores running in the VM. Also, one can easily run a system where it does bulk indexing, and then lowers the indexing buffer size for more "streamline" work. Its just a shame to close the writer for that (and having to pause all indexing work while this happens).

The term interval and divisor, I agree, are such obscure (funnily, I use the divisor quite a lot), that closing the writer and opening it again make sense.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If I had to choose between being able to change things in real time to better concurrency thanks to immutability, I would definitely go with better concurrency.

We're not talking about better concurrency here – making these fields volatile will be in the noise.

If ever it were not in the noise, I agree better concurrency should win.

I don't think we are talking about that many setter to begin with (I don't think we should bring those back to the IndexWriter).

True, but we are setting a precedent here. Ie this will apply to further settings we add to IWC, apply to any other classes that we pull config out of (eg IR), etc. I don't like that factoring out config means loss of functionality.

It should not be consulted once the construction phase of IW is done.

Why such purity? What do we gain?

I'm all for purity, but only if it doesn't interfere w/ functionality. Here, it's taking away freedom...

I would prefer to have the config you passed into IW remain "live". You can also do IW.getConfig().setXXX later too.

In fact it should be fine to share an IWC across multiple writers; you can change the RAM buffer for all of them at once.

If explicit real time parameters are to be set, then IW should expose it as a setter.

But then whenever we change our mind about liveness, we have to change the API? I'd like to decouple liveness of a setter (really a semantic aspect of that API, documented in the jdocs) from which API is used to set it.

I think a config param should be live by default and only if there's some hardship / reason to not have it so, should we make an exception. Most of these params are trivial to be live (they were before the IWC change).

making nonvolatile mutable references introduces you to a world of hard-to-catch unsafe publication bugs (yes, infoStream is currently broken!).

Well, in theory, yes, in practice, no. This is like stating your HTML is buggy because it fails to put a closing </html> tag and so some browser could fail to render it ;)

I doubt there's any JVM out there where our lack-of-volatile infoStream causes any problems.

But, of course, we should make them volatile to be safe...

Each and every live reconfigurable setting adds to complexity.

That's the exception not the rule. Most of these settings are used at certain times – on flushing a new seg, on warming a seg, etc. – and there's no added complexity to simply pulling their current value from the config.

For more complex cases, certain on-change logic is required.

For such cases we can state that changes to the config do not take effect; eg IndexingChain is a good example I think. But I think the default should be that changes are live, unless otherwise stated

I don't think we should go out of our way to be making settings live, if there's any hair involved. But for the settings where there's no hair involved, they should be live.

asfimport commented 13 years ago

Shay Banon (@kimchy) (migrated from JIRA)

Just a note regarding the IWC and being able to consult it for live changes, it feels strange to me that settings something on the config will affect the IW in real time. Maybe its just me, but it feels nicer to have the "live" setters on IW compared to IWC.

I also like the ability to decouple construction time configuration through IWC, and live settings through setters on IW. It is then very clear what can be set on construction time, and what can be set on a live IW. It also allows for compile time / static check for the code what can be changed at what lifecycle phase.

asfimport commented 13 years ago

Earwin Burrfoot (migrated from JIRA)

Why such purity? What do we gain?

I'm all for purity, but only if it doesn't interfere w/ functionality. Here, it's taking away freedom...

We gain consistency and predictability. And there are a lot of freedoms dangerous for developers.

In fact it should be fine to share an IWC across multiple writers; you can change the RAM buffer for all of them at once.

You've brought up a purrfect example of how NOT to do things. This is called 'action at a distance' and is a damn bug. Very annoying one. I've thoroughly experienced it with previous major version of Apache HTTPClient - they had an API that suggested you can set per-request timeouts, while these were actually global for a single Client instance. I fried my brain trying to understand why the hell random user requests timeout at hundred times their intended duration. Oh! It was an occasional admin request changing the global.

<irony> You know, you can actually instantiate some DateRangeFilter with a couple of Dates, and then change these Dates (they are writeable) before each request. Isn't it an exciting kind of programming freedom? Or, back to our current discussion - we can pass RAMBufferSizeMB as an AtomicDouble, instead of current double, then we can use .set() on an instance we passed, and have our live reconfigurability. What's more - AtomicDouble protects us from word tearing! </irony>

I doubt there's any JVM out there where our lack-of-volatile infoStream causes any problems.

Er.. While I have never personally witnessed unsynchronized long/double tearing, I've seen the consequence of unsafely publishing a HashMap - an endless loop on get(). It happened on your run off the mill Sun 1.6 JVM. So the bug is there, lying in wait. Maybe nobody ever actually used the freedom to change infoStream in-flight, or the guy was lucky, or in his particular situation the field was guarded by some unrelated sync.

While I see banishing live reconfiguration from IW as a lost cause, I ask to make IWC immutable at the very least. As Shay said - this will provide a clear barrier between mutable and immutable properties.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Maybe its just me, but it feels nicer to have the "live" setters on IW compared to IWC.

+1.

I agree that live settings on IWC feels too much 'action at a distance'. It's not a great pattern.

FWIW, live config on the IW and a static IWC feels right to me. Earwin makes a strong argument for dropping live altogether - though I'm still liking the idea of allowing it where it's easy and especially where it has existed.

And I've always felt unsafe publication was scary...

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Er.. While I have never personally witnessed unsynchronized long/double tearing, I've seen the consequence of unsafely publishing a HashMap - an endless loop on get().

I've also seen JMM strike too – it caused one of our unit tests to spin forever, because a "volatile" was missing.

But this will never impact rarely used fields (infoStream, termIndexInterval, segmentWarmer, etc.), in practice.

Really we need an anal Java impl. (or, maybe, CPU) that randomly asserts its "rights" under JMM, to hold a cached copy of any field that's not volatile for unusual/random lengths of time (basically an "adversary" yet still playing by the JMM rules). Such an impl would find TONS of JMM bugs in Lucene (and I imagine any other Java app/library tested).

Yet, no "real" Java impl out there will ever do this since doing so will simply make that Java impl appear buggy. (Well, and, it'd be bad for perf. – obviously the Java impl, CPU cache levels, should cache only frequently used things).

It's exactly why all web browsers today are tolerant to a missing </html> tag and no browser could afford to suddenly refuse to render because you're missing the </html> tag.

I'm not saying we shouldn't put in our </html> tags in Lucene; we definitely should... we have no choice. But, in practice, these missing </html> tags all throughout Lucene are not a problem.

I ask to make IWC immutable at the very least

IWC cannot be made immutable – you build it up incrementally (new IWC(...).setThis(...).setThat(...)). Its fields cannot be final. (Well, one field can and is: analyzer).

How about this as a compromise: IW continues cloning the incoming IWC on init, as it does today. This means any changes to the IWC instance you passed to IW will have no effect on IW.

But, if you want to change something live, you can IW.getConfig().setFoo(...). The config instance is a private clone to that IW.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

How about this as a compromise: IW continues cloning the incoming IWC on init, as it does today. This means any changes to the IWC instance you passed to IW will have no effect on IW.

+1

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch, implementing the proposed compromise.

I brought back the two prior unit tests, verifying changes to ram buffer take effect live; they pass.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Hmmm, infoStream is just for debugging... should we really make it volatile? Although volatile reads are cheap on current x86, they do prevent optimizations/reorderings across them. Too many volatile reads in inner loops can add up.

infoStream is a PrintStream, which synchronizes anyway, so it should be safe to omit the volatile.

asfimport commented 13 years ago

Earwin Burrfoot (migrated from JIRA)

infoStream is a PrintStream, which synchronizes anyway, so it should be safe to omit the volatile

You're absolutely right here.

Yet, no "real" Java impl out there will ever do this since doing so will simply make that Java impl appear buggy.

Sorry, but "real" Java impls do this. The case with endless get() happened on a map that was never modified after being created and set. Just one of the many JVM instances on many machines got unlucky after restart.

Well, and, it'd be bad for perf. – obviously the Java impl, CPU cache levels, should cache only frequently used things

Java impls don't cache things. They do reorderings, they also keep final fields on registers, omitting reloads that happen for non-final ones, but no caching in JMM-related cases. Caching here is done by CPU, and it caches all data read from memory.

IWC cannot be made immutable – you build it up incrementally (new IWC(...).setThis(...).setThat(...)). Its fields cannot be final.

Setters can return modified immutable copy of 'this'. So you get both incremental building and immutability.

How about this as a compromise: IW continues cloning the incoming IWC on init, as it does today. This means any changes to the IWC instance you passed to IW will have no effect on IW.

What about earlier compromise mentioned by Shay, Mark, me? Keep setters for 'live' properties on IW. This clearly draws the line, and you don't have to consult Javadocs for each and every setting to know if you can change it live or not.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmmm, infoStream is just for debugging... should we really make it volatile?

I'll remove its volatile...

IWC cannot be made immutable – you build it up incrementally (new IWC(...).setThis(...).setThat(...)). Its fields cannot be final.

Setters can return modified immutable copy of 'this'. So you get both incremental building and immutability.

Oh yeah. But then we'd clone the full IWC on every set... this seems like overkill in the name of "purity".

What about earlier compromise mentioned by Shay, Mark, me? Keep setters for 'live' properties on IW. This clearly draws the line, and you don't have to consult Javadocs for each and every setting to know if you can change it live or not.

I really don't like that this approach would split IW configuration into two places. Like you look at the javadocs for IWC and think that you cannot change the RAM buffer size.

IWC should be the one place you go to see which settings you can change about the IW. That some of these settings take effect "live" while others do not is really an orthogonal (and I think, secondary, ie handled fine w/ jdocs) aspect/concern.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I really don't like that this approach would split IW configuration into two places. Like you look at the javadocs for IWC and think that you cannot change the RAM buffer size.

IWC should be the one place you go to see which settings you can change about the IW. That some of these settings take effect "live" while others do not is really an orthogonal (and I think, secondary, ie handled fine w/ jdocs) aspect/concern.

You can just as easily argue that the javadocs for IWC could explain that live settings are on the IW.

That pattern just smells wrong.

But, if you want to change something live, you can IW.getConfig().setFoo(...). The config instance is a private clone to that IW.

This is better than nothing.

Another thought is to offer all settings on the IWC for init convenience and exposure and then add javadoc about updaters on IW for those settings that can be changed on the fly - or one update method and enums...

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

How about an IWC base class, extended by IWCinit and IWClive. IWCinit has setters for everything, and IW.getConfig() returns IWClive, which has no setters for things you can't set on the fly.

asfimport commented 13 years ago

Earwin Burrfoot (migrated from JIRA)

Oh yeah. But then we'd clone the full IWC on every set... this seems like overkill in the name of "purity".

So what? What exactly is overkill? Few wasted bytes and CPU ns for an object that's created a couple of times during application lifetime? There are also builders, which are very similar to what Steven is proposing.

Another thought is to offer all settings on the IWC for init convenience and exposure and then add javadoc about updaters on IW for those settings that can be changed on the fly

That's exactly how I'd like to see it.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Its exactly the lack of consensus we see here, thats why I am 100% against having the setter approach.

I'm totally against some deprecation/undeprecation loop because we in future releases another setting wants to be "live".

It seems the only way we can avoid this, is for javadoc to be the only specification as to whether a setting does or does not take effect "live".

asfimport commented 13 years ago

Earwin Burrfoot (migrated from JIRA)

You avoid deprecation/undeprecation and binary incompatibility, while incompatibly changing semantics. What do you win?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

You win the fact that this is such an expert thing, and it should not confuse 99% of users who won't need to change these settings in a live way.

This is a central API to using lucene, sorry i would rather see IWConfig be reverted completely than see this deprecation/undeprecation loop, it would just cause too much confusion.

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

How about an IWC base class, extended by IWCinit and IWClive. IWCinit has setters for everything, and IW.getConfig() returns IWClive, which has no setters for things you can't set on the fly.

I tried to implement this, but couldn't figure out a way to avoid code and javadoc duplication and/or separation for the live setters, which need to be on both the init and live versions. Duplication/separation of this sort would be begging for trouble. (The live setters can't be on the base class because the init and live versions would have to return different types to allow for proper chaining.)

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

How about an IWC base class, extended by IWCinit and IWClive. IWCinit has setters for everything, and IW.getConfig() returns IWClive, which has no setters for things you can't set on the fly.

I tried to implement this, but couldn't figure out a way to avoid code and javadoc duplication and/or separation for the live setters, which need to be on both the init and live versions.

An annotation processor that looks for @Live annotations on setters, then generates source for a LiveIWC class, an instance of which would be returned by IW.getConfig(), would solve the duplication/separation problem. No extension required: LiveIWC could forward all getters and the live setters to a cloned IWC.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Oh yeah. But then we'd clone the full IWC on every set... this seems like overkill in the name of "purity".

So what? What exactly is overkill? Few wasted bytes and CPU ns for an object that's created a couple of times during application lifetime? There are also builders, which are very similar to what Steven is proposing.

I don't like that this'd be an O(N*M) cost API when you use it. Sure, N and M are tiny, and you use this API very rarely, but I still don't like it ;)

In addition, because this is all in the name of "purity" which as far as I can see has no real value besides "purity". It's... incestuous. And, I'm a pragmatist, I guess.

An annotation processor that looks for @Live annotations on setters, then generates source for a LiveIWC class, an instance of which would be returned by IW.getConfig(), would solve the duplication/separation problem. No extension required: LiveIWC could forward all getters and the live setters to a cloned IWC.

I think this is overkill? (Ie to have @Live compile to LiveIWC vs InitIWC). Though, @Live would be nice for jdocs?

You win the fact that this is such an expert thing, and it should not confuse 99% of users who won't need to change these settings in a live way.

Right – simple things should be simple and complex things should be possible.

The current patch achieves this – the 99% of "simple" users that just want to config IW and create it find all of the config in one place. The 1% complex cases (need to change live settings) are able to do so, but must read the jdocs for each setter to verify it's supported. The API should be designed around the simple users not the complex ones, as the current patch does.

So... I think the current patch is ready to commit (except, I'll remove the </html> tag for infoStream & defaultInfoStream).

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The current patch achieves this – the 99% of "simple" users that just want to config IW and create it find all of the config in one place. The 1% complex cases (need to change live settings) are able to do so, but must read the jdocs for each setter to verify it's supported.

The proposed alternatives sound just as good as this? In the proposed compromises, the 99% of simple users will see have one place to config IW as well for the avg 'set up front' use case. Perhaps the complex users could also have an API with a better pattern and it doesn't have to be either or as you seem to lead...

An IWC that is 'partially' live and can be changed externally after passing to the IW is just an inferior pattern plain and simple, and doesn't gain you much.

The API should be designed around the simple users not the complex ones, as the current patch does.

This really depends. If the simple users can be satisfied, and the API can also be decent for complex users, win/win.

I guess I would place my bets that there will not be a ton of deprecations loops of settings wanting to be live.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Though don't take that I don't agree as a hold up to finishing 3.1.

asfimport commented 13 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Though don't take that I don't agree as a hold up to finishing 3.1.

+1

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1