apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.64k stars 1.03k forks source link

Refactoring of IndexWriter [LUCENE-2026] #3101

Open asfimport opened 14 years ago

asfimport commented 14 years ago

I've been thinking for a while about refactoring the IndexWriter into two main components.

One could be called a SegmentWriter and as the name says its job would be to write one particular index segment. The default one just as today will provide methods to add documents and flushes when its buffer is full. Other SegmentWriter implementations would do things like e.g. appending or copying external segments [what addIndexes*() currently does].

The second component's job would it be to manage writing the segments file and merging/deleting segments. It would know about DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would provide hooks that allow users to manage external data structures and keep them in sync with Lucene's data during segment merges.

API wise there are things we have to figure out, such as where the updateDocument() method would fit in, because its deletion part affects all segments, whereas the new document is only being added to the new segment.

Of course these should be lower level APIs for things like parallel indexing and related use cases. That's why we should still provide easy to use APIs like today for people who don't need to care about per-segment ops during indexing. So the current IndexWriter could probably keeps most of its APIs and delegate to the new classes.

Migrated from LUCENE-2026 by Michael Busch, 1 vote, updated May 09 2016 Linked issues:

2954

asfimport commented 14 years ago

John Wang (migrated from JIRA)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1! IndexWriter has become immense.

I think we should also pull out ReaderPool?

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

I think we should also pull out ReaderPool?

+1!

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

We need an ability to see segment write (and probably deleted doc list write) as a discernible atomic operation. Right now it looks like several file writes, and we can't, say - redirect all files belonging to a certain segment to another Directory (well, in a simple manner). 'Something' should sit between a Directory (or several Directories) and IndexWriter.

If we could do this, the current NRT search implementation will be largely obsoleted, innit? Just override the default impl of 'something' and send smaller segments to ram, bigger to disk, copy ram segments to disk asynchronously if we want to. Then we can use your granma's IndexReader and IndexWriter, totally decoupled from each other, and have blazing fast addDocument-commit-reopen turnaround.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Oh, forgive me if I just said something stupid :)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think what you're describing is in fact the approach that #2390 is taking; it's doing the switching internally between the main Dir & a private RAM Dir.

But in my testing so far (#3137), it doesn't seem like it'll help performance much. Ie, the OS generally seems to do a fine job putting those segments in RAM, itself. Ie, by maintaining a write cache. The weirdness is: that only holds true if you flush the segments when they are tiny (once per second, every 100 docs, in my test) – not yet sure why that's the case. I'm going to re-run perf tests on a more mainstream OS (my tests are all OpenSolaris) and see if that strangeness still happens.

But I think you still need to not do commit() during the reopen.

I do think refactoring IW so that there is a separate component that keeps track of segments in the index, may simplify NRT, in that you can go to that source for your current "segments file" even if that segments file is uncommitted. In such a world you could do something like IndexReader.open(SegmentState) and it would be able to open (and, reopen) the real-time reader. It's just that it's seeing changes to the SegmentState done by the writer, even if they're not yet committed.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call. That sync call on memory-based Directory is noop.

And no, you really should commit() to be able to see stuff on reopen() :) My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

The original purporse of current NRT code was.. well.. let me remember.. NRT search! :) With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

*: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If I understand everything right, with current uberfast reopens (thanks per-segment search), the only thing that makes index/commit/reopen cycle slow is the 'sync' call.

I agree, per-segment searching was the most important step towards NRT. It's a great step forward...

But the fsync call is a killer, so avoiding it in the NRT path is necessary. It's also very OS/FS dependent.

That sync call on memory-based Directory is noop.

Until you need to spillover to disk because your RAM buffer is full?

Also, if IW.commit() is called, I would expect any changes in RAM should be committed to the real dir (stable storage)?

And, going through RAM first will necessarily be a hit on indexing throughput (Jake estimates 10% hit in Zoie's case). Really, our current approach goes through RAM as well, in that OS's write cache (if the machine has spare RAM) will quickly accept the small index files & write them in the BG. It's not clear we can do better than the OS here...

And no, you really should commit() to be able to see stuff on reopen() My god, seeing changes that aren't yet commited - that violates the meaning of 'commit'.

Uh, this is an API that clearly states that its purpose is to search the uncommitted changes. If you really want to be "pure" transactional, don't use this API ;)

The original purporse of current NRT code was.. well.. let me remember.. NRT search! With per-segment caches and sync lag defeated you get the delay between doc being indexed and becoming searchable under tens of milliseconds. Is that not NRT enough to introduce tight coupling between classes that have absolutely no other reason to be coupled?? Lucene 4.0. Simplicity is our candidate! Vote for Simplicity!

In fact I favor our current approach because of its simplicity.

Have a look at #2390 (adds RAMDir as you're discussing), or, Zoie, which also adds the RAMDir and backgrounds resolving deleted docs – they add complexity to Lucene that I don't think is warranted.

My general feeling at this point is with per-segment searching, and fsync avoided, NRT performance is excellent.

We've explored a number of possible tweaks to improve it – writing first to RAMDir (#2390), resolving deletes in the foreground (#3122), using paged BitVector for deletions (#2600), Zoie (buffering segments in RAM & backgrounds resolving deletes), etc., but, based on testing so far, I don't see the justification for the added complexity.

*: Okay, there remains an issue of merges that piggyback on commits, so writing and commiting one smallish segment suddenly becomes a time-consuming operation. But that's a completely separate issue. Go, fix your mergepolicies and have a thread that merges asynchronously.

This already runs in the BG by default. But warming the reader on the merged segment (before lighting it) is important (IW does this today).

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Until you need to spillover to disk because your RAM buffer is full?

No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

Okay, we agree on a zillion of things, except simpicity of the current NRT, and approach to commit().

Good commit() behaviour consists of two parts:

Everything commit()ed is guaranteed to be on disk.
Until commit() is called, reading threads don't see new/updated records.

Now we want more speed, and are ready to sacrifice something if needed. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs. Now you don't have to hack into IW and write specialized readers. Hence, simpicity. You have only one straightforward writer, you have only one straightforward reader (which is nicely immutable and doesn't need any synchronization code).

In fact you don't even need to sacrifice write guarantee. What was the reason for it? The only one I can come up with is - the thread that does writes and sync() is different from the thread that calls commit(). But, commit() can return a Future. So the process goes as:

You index docs, nobody sees them, nor deletions.
You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.
Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.

For back-compat purporses we can use another name for that Future-returning-commit(), and current commit() will just call this new method and wait on future returned.

Okay, with that I'm probably shutting up on the topic until I can back myself up with code. Sadly, my current employer is happy with update lag in tens of seconds :)

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> I say it's better to sacrifice write guarantee.

I don't grok why sync is the default, especially given how sketchy hardware drivers are about obeying fsync:

But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the
appearance of faster performance.

IMO, it should have been an option which defaults to false, to be enabled only by users who have the expertise to ensure that fsync() is actually doing what it advertises. But what's done is done (and Lucy will probably just do something different.)

With regard to Lucene NRT, though, turning sync() off would really help. If and when some sort of settings class comes about, an enableSync(boolean enabled) method seems like it would come in handy.

asfimport commented 14 years ago

Jake Mannix (migrated from JIRA)

Now we want more speed, and are ready to sacrifice something if needed.

bq. You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

Chiming in here that of course, you don't need (ie there is a choice) to hack into the IW to do this. Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.

The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Until you need to spillover to disk because your RAM buffer is full?

No, buffer is there only to decouple indexing from writing. Can be spilt over asynchronously without waiting for it to be filled up.

But this is where things start to get complex... the devil is in the details here. How do you carry over your deletes? This spillover will take time – do you block all indexing while that's happening (not great)? Do you do it gradually (start spillover when half full, but still accept indexing)? Do you throttle things if index rate exceeds flush rate? How do you recover on exception?

NRT today let's the OS's write cache decide how to use RAM to speed up writing of these small files, which keeps things alot simpler for us. I don't see why we should add complexity to Lucene to replicate what the OS is doing for us (NOTE: I don't really trust the OS in the reverse case... I do think Lucene should read into RAM the data structures that are important).

You decide to sacrifice new record (in)visibility. No choice, but to hack into IW to allow readers see its hot, fresh innards.

Now you don't have to hack into IW and write specialized readers.

Probably we'll just have to disagree here... NRT isn't a hack ;)

IW is already hanging onto completely normal segments. Ie, the index has been updated with these segments, just not yet published so outside readers can see it. All NRT does is let a reader see this private view.

The readers that an NRT reader expoes are normal SegmentReaders – it's just that rather than consult a segments_N on disk to get the segment metadata, they pulled from IW's uncommitted in memory SegmentInfos instance.

Yes we've talked about the "hot innards" solution – an IndexReader impl that can directly search DW's ram buffer – but that doesn't look necessary today, because performance of NRT is good with the simple solution we have now.

NRT reader also gains performance by carrying over deletes in RAM. We should eventually do the same thing with norms & field cache. No reason to write to disk, then right away read again.

You index docs, nobody sees them, nor deletions.

You call commit(), the docs/deletes are written down to memory (NRT case)/disk (non-NRT case). Right after calling commit() every newly reopened Reader is guaranteed to see your docs/deletes.

Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), and fires up the Future returned from commit(). At this point all data is guaranteed to be written and braced for a crash, ram cache or not, OS/raid controller cache or not.

But this is not a commit, if docs/deletes are written down into RAM? Ie, commit could return, then the machine could crash, and you've lost changes? Commit should go through to stable storage before returning? Maybe I'm just missing the big picture of what you're proposing here...

Also, you can build all this out on top of Lucene today? Zoie is a proof point of this. (Actually: how does your proposal differ from Zoie? Maybe that'd help shed light...).

I say it's better to sacrifice write guarantee. In the rare case the process/machine crashes, you can reindex last few minutes' worth of docs.

It is not that simple – if you skip the fsync, and OS crashes/you lose power, your index can easily become corrupt. The resulting CheckIndex -fix can easily need to remove large segments.

The OS's write cache makes no gurantees on the order in which the files you've written find their way to disk.

Another option (we've discussed this) would be journal file approach (ie transaction log, like most DBs use). You only have one file to fsync, and you replay to recover. But that'd be a big change for Lucene, would add complexity, and can be accomplished outside of Lucene if an app really wants to...

Let me try turning this around: in your componentization of SegmentReader, why does it matter who's tracking which components are needed to make up a given SR? In the IndexReader.open case, it's a SegmntInfos instance (obtained by loading segments_N file from disk). In the NRT case, it's also a SegmentInfos instace (the one IW is privately keeping track of and only publishing on commit). At the component level, creating the SegmentReader should be no different?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I say it's better to sacrifice write guarantee.

I don't grok why sync is the default, especially given how sketchy hardware drivers are about obeying fsync:

But, beware: some hardware devices may in fact cache writes even during fsync, and return before the bits are actually on stable storage, to give the appearance of faster performance.

It's unclear how often this scare-warning is true in practice (scare warnings tend to spread very easily without concrete data); it's in the javadocs for completeness sake. I expect (though have no data to back this up...) that most OS/IO systems "out there" do properly implement fsync.

IMO, it should have been an option which defaults to false, to be enabled only by users who have the expertise to ensure that fsync() is actually doing what it advertises. But what's done is done (and Lucy will probably just do something different.)

I think that's a poor default (trades safety for performance), unless Lucy eg uses a transaction log so you can concretely bound what's lost on crash/power loss. Or, if you go back to autocommitting I guess...

If we did this in Lucene, you can have unbounded corruption. It's not just the last few minutes of updates...

So, I don't think we should even offer the option to turn it off. You can easily subclass your FSDir impl and make sync() a no-op if your really want to...

With regard to Lucene NRT, though, turning sync() off would really help. If and when some sort of settings class comes about, an enableSync(boolean enabled) method seems like it would come in handy.

You don't need to turn off sync for NRT – that's the whole point. It gives you a reader without syncing the files. Really, this is your safety tradeoff – it means you can commit less frequently, since the NRT reader can search the latest updates. But, your app has complete control over how it wants to to trade safety for performance.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Zoie is a completely user-land solution which modifies no IW/IR internals and yet achieves millisecond index-to-query-visibility turnaround while keeping speedy indexing and query performance. It just keeps the RAMDir outside encapsulated in an object (an IndexingSystem) which has IndexReaders built off of both the RAMDir and the FSDir, and hides the implementation details (in fact the IW itself) from the user.

Right, one can always not use NRT and build their own layers on top.

But, Zoie has alot of code to accomplish this – the devil really is in the details to "simply write first to a RAMDir". This is why I'd like Earwin to look @ Zoie and clarify his proposed approach, in contrast...

Actually, here's a question: how quickly can Zoie turn around a commit()? Seems like it must take more time than Lucene, since it does extra stuff (flush RAM buffers to disk, materialize deletes) before even calling IW.commit.

At the end of the day, any NRT system has to trade safety for performance (bypass the sync call in the NRT reader)....

The API for this kind of thing doesn't have to be tightly coupled, and I would agree with you that it shouldn't be.

I don't consider NRT today to be a tight coupling (eg, the pending refactoring of IW would nicely separate it out). If we implement the IR that searches DW's RAM buffer, then I'd agree ;)

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> I think that's a poor default (trades safety for performance), unless > Lucy eg uses a transaction log so you can concretely bound what's lost > on crash/power loss. Or, if you go back to autocommitting I guess...

Search indexes should not be used for canonical data storage – they should be built on top of canonical data storage. Guarding against power failure induced corruption in a database is an imperative. Guarding against power failure induced corruption in a search index is a feature, not an imperative.

Users have many options for dealing with the potential for such corruption. You can go back to your canonical data store and rebuild your index from scratch when it happens. In a search cluster environment, you can rsync a known-good copy from another node. Potentially, you might enable fsync-before-commit and keep your own transaction log. However, if the time it takes to rebuild or recover an index from scratch would have caused you unacceptable downtime, you can't possibly be operating in a single-point-of-failure environment where a power failure could take you down anyway – so other recovery options are available to you.

Turning on fsync is only one step towards ensuring index integrity; others steps involve making decisions about hard drives, RAID arrays, failover strategies, network and off-site backups, etc, and are outside of our domain as library authors. We cannot meet the needs of users who need guaranteed index integrity on our own.

For everybody else, what turning on fsync by default achieves is to make an exceedingly rare event rarer. That's valuable, but not essential. My argument is that since the search indexes should not be used for canonical storage, and since fsync is not testably reliable and not sufficient on its own, it's a good engineering compromise to prioritize performance.

> If we did this in Lucene, you can have unbounded corruption. It's not > just the last few minutes of updates...

Wasn't that a possibility under autocommit as well? All it takes is for the OS to finish flushing the new snapshot file to persistent storage before it finishes flushing a segment data file needed by that snapshot, and for the power failure to squeeze in between.

In practice, locality of reference is going to make the window very very small, since those two pieces of data will usually get written very close to each other on the persistent media.

I've seen a lot more messages to our user lists over the years about data corruption caused by bugs and misconfigurations than by power failures.

But really, that's as it should be. Ensuring data integrity to the degree required by a database is costly – it requires far more rigorous testing, and far more conservative development practices. If we accept that our indexes must never go corrupt, it will retard innovation.

Of course we should work very hard to prevent index corruption. However, I'm much more concerned about stuff like silent omission of search results due to overzealous, overly complex optimizations than I am about problems arising from power failures. When a power failure occurs, you know it – so you get the opportunity to fsck the disk, run checkIndex(), perform data integrity reconciliation tests against canonical storage, and if anything fails, take whatever recovery actions you deem necessary.

> You don't need to turn off sync for NRT - that's the whole point. It > gives you a reader without syncing the files.

I suppose this is where Lucy and Lucene differ. Thanks to mmap and the near-instantaneous reader opens it has enabled, we don't need to keep a special reader alive. Since there's no special reader, the only way to get data to a search process is to go through a commit. But if we fsync on every commit, we'll drag down indexing responsiveness. Fishishing the commit and returning control to client code as quickly as possible is a high priority for us.

Furthermore, I don't want us to have to write the code to support a near-real-time reader hanging off of IndexWriter a la Lucene. The architectural discussions have made for very interesting reading, but the design seems to be tricky to pull off, and implementation simplicity in core search code is a high priority for Lucy. It's better for Lucy to kill two birds with one stone and concentrate on making all index opens fast.

> Really, this is your safety tradeoff - it means you can commit less > frequently, since the NRT reader can search the latest updates. But, your > app has complete control over how it wants to to trade safety for > performance.

So long as fsync is an option, the app always has complete control, regardless of whether the default setting is fsync or no fsync.

If a Lucene app wanted to increase NRT responsiveness and throughput, and if absolute index integrity wasn't a concern because it had been addressed through other means (e.g. multi-node search cluster), would turning off fsync speed things up under any of the proposed designs?

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

I think large scale NRT installations may eventually require a distributed transaction log. The implementation details have yet to be determined however it could potentially solve the issue of data loss being discussed. One candidate is a combo of Zookeeper

Bookeeper. I would venture to guess this could be implemented as a part of Solr, however we've got a lot of work to do for Solr to be reasonably NRT efficient (see the tracking issue SOLR-1606), and we're just starting on the Zookeeper implementation SOLR-1277...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think that's a poor default (trades safety for performance), unless Lucy eg uses a transaction log so you can concretely bound what's lost on crash/power loss. Or, if you go back to autocommitting I guess...

Search indexes should not be used for canonical data storage - they should be built on top of canonical data storage.

I agree with that, in theory, but I think in practice it's too idealistic to force/expect apps to meet that ideal.

I expect for many apps it's a major cost to unexpectedly lose the search index on power loss / OS crash.

Users have many options for dealing with the potential for such corruption. You can go back to your canonical data store and rebuild your index from scratch when it happens. In a search cluster environment, you can rsync a known-good copy from another node. Potentially, you might enable fsync-before-commit and keep your own transaction log. However, if the time it takes to rebuild or recover an index from scratch would have caused you unacceptable downtime, you can't possibly be operating in a single-point-of-failure environment where a power failure could take you down anyway - so other recovery options are available to you.

Turning on fsync is only one step towards ensuring index integrity; others steps involve making decisions about hard drives, RAID arrays, failover strategies, network and off-site backups, etc, and are outside of our domain as library authors. We cannot meet the needs of users who need guaranteed index integrity on our own.

Yes, high availability apps will already take their measures to protect the search index / recovery process, going beyond fsync. EG, making a hot backup of Lucene index is now straightforwarded.

For everybody else, what turning on fsync by default achieves is to make an exceedingly rare event rarer. That's valuable, but not essential. My argument is that since the search indexes should not be used for canonical storage, and since fsync is not testably reliable and not sufficient on its own, it's a good engineering compromise to prioritize performance.

Losing power to the machine, or OS crash, or the user doing a hard power down because OS isn't responding, I think are not actually that uncommon in an end user setting. Think of a desktop app embedding Lucene/Lucy...

If we did this in Lucene, you can have unbounded corruption. It's not just the last few minutes of updates...

Wasn't that a possibility under autocommit as well? All it takes is for the OS to finish flushing the new snapshot file to persistent storage before it finishes flushing a segment data file needed by that snapshot, and for the power failure to squeeze in between.

Not after #2120... autoCommit simply called commit() at certain opportune times (after finish big merges), which does the right thing (I hope!). The segments file is not written until all files it references are sync'd.

In practice, locality of reference is going to make the window very very small, since those two pieces of data will usually get written very close to each other on the persistent media.

Not sure about that – it depends on how effectively the OS's write cache "preserves" that locality.

I've seen a lot more messages to our user lists over the years about data corruption caused by bugs and misconfigurations than by power failures.

I would agree, though, I think it may be a sampling problem... ie people whose machines crashed and they lost the search index would often not raise it on the list (vs say a persistent config issue that keeps leading to corruption).

But really, that's as it should be. Ensuring data integrity to the degree required by a database is costly - it requires far more rigorous testing, and far more conservative development practices. If we accept that our indexes must never go corrupt, it will retard innovation.

It's not really that costly, with NRT – you can get a searcher on the index without paying the commit cost. And now you can call commit however frequently you need to. Quickly turning around a new searcher, and how frequently you commit, are now independent.

Also, having the app explicitly decouple these two notions keeps the door open for future improvements. If we force absolutely all sharing to go through the filesystem then that limits the improvements we can make to NRT.

Of course we should work very hard to prevent index corruption. However, I'm much more concerned about stuff like silent omission of search results due to overzealous, overly complex optimizations than I am about problems arising from power failures. When a power failure occurs, you know it - so you get the opportunity to fsck the disk, run checkIndex(), perform data integrity reconciliation tests against canonical storage, and if anything fails, take whatever recovery actions you deem necessary.

Well... I think search performance is important, and we should pursue it even if we risk bugs.

You don't need to turn off sync for NRT - that's the whole point. It gives you a reader without syncing the files.

I suppose this is where Lucy and Lucene differ. Thanks to mmap and the near-instantaneous reader opens it has enabled, we don't need to keep a special reader alive. Since there's no special reader, the only way to get data to a search process is to go through a commit. But if we fsync on every commit, we'll drag down indexing responsiveness. Fishishing the commit and returning control to client code as quickly as possible is a high priority for us.

NRT reader isn't that special – the only things different is 1) it loaded the segments_N "file" from IW instead of the filesystem, and 2) it uses a reader pool to "share" the underlying SegmentReaders with other places that have loaded them. I guess, if Lucy won't allow this, then, yes, forcing a commit in order to reopen is very costly, and so sacrificing safety is a tradeoff you have to make.

Alternatively, you could keep the notion "flush" (an unsafe commit) alive? You write the segments file, but make no effort to ensure it's durability (and also preserve the last "true" commit). Then a normal IR.reopen suffices...

Furthermore, I don't want us to have to write the code to support a near-real-time reader hanging off of IndexWriter a la Lucene. The architectural discussions have made for very interesting reading, but the design seems to be tricky to pull off, and implementation simplicity in core search code is a high priority for Lucy. It's better for Lucy to kill two birds with one stone and concentrate on making all index opens fast.

But shouldn't you at least give an option for index durability? Even if we disagree about the default?

Really, this is your safety tradeoff - it means you can commit less frequently, since the NRT reader can search the latest updates. But, your app has complete control over how it wants to to trade safety for performance.

So long as fsync is an option, the app always has complete control, regardless of whether the default setting is fsync or no fsync.

Well it is an "option" in Lucene – "it's just software" ;) I don't want to make it easy to be unsafe. Lucene shouldn't sacrifice safety of the index... and with NRT there's no need to make that tradeoff.

If a Lucene app wanted to increase NRT responsiveness and throughput, and if absolute index integrity wasn't a concern because it had been addressed through other means (e.g. multi-node search cluster), would turning off fsync speed things up under any of the proposed designs?

Yes, turning off fsync would speed things up – you could fall back to simple reopen and get good performance (NRT should still be faster since the readers are pooled). The "use RAMDir on top of Lucene" designs would be helped less since fsync is a noop in RAMDir.

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

>> Wasn't that a possibility under autocommit as well? All it takes is for the >> OS to finish flushing the new snapshot file to persistent storage before it >> finishes flushing a segment data file needed by that snapshot, and for the >> power failure to squeeze in between. > > Not after #2120... autoCommit simply called commit() at certain > opportune times (after finish big merges), which does the right thing (I > hope!). The segments file is not written until all files it references are > sync'd.

FWIW, autoCommit doesn't really have a place in Lucy's one-segment-per-indexing-session model.

Revisiting the #2120 threads, one passage stood out:

http://www.gossamer-threads.com/lists/lucene/java-dev/54321#54321

This is why in a db system, the only file that is sync'd is the log file - all other files can be made "in sync" from the log file - and this file is normally striped for optimum write performance. Some systems have special "log file drives" (some even solid state, or battery backed ram) to aid the performance.

The fact that we have to sync all files instead of just one seems sub-optimal.

Yet Lucene is not well set up to maintain a transaction log. The very act of adding a document to Lucene is inherently lossy even if all fields are stored, because doc boost is not preserved.

> Also, having the app explicitly decouple these two notions keeps the > door open for future improvements. If we force absolutely all sharing > to go through the filesystem then that limits the improvements we can > make to NRT.

However, Lucy has much more to gain going through the file system than Lucene does, because we don't necessarily incur JVM startup costs when launching a new process. The Lucene approach to NRT – specialized reader hanging off of writer – is constrained to a single process. The Lucy approach – fast index opens enabled by mmap-friendly index formats – is not.

The two approaches aren't mutually exclusive. It will be possible to augment Lucy with a specialized index reader within a single process. However, A) there seems to be a lot of disagreement about just how to integrate that reader, and B) there seem to be ways to bolt that functionality on top of the existing classes. Under those circumstances, I think it makes more sense to keep that feature external for now.

> Alternatively, you could keep the notion "flush" (an unsafe commit) > alive? You write the segments file, but make no effort to ensure it's > durability (and also preserve the last "true" commit). Then a normal > IR.reopen suffices...

That sounds promising. The semantics would differ from those of Lucene's flush(), which doesn't make changes visible.

We could implement this by somehow marking a "committed" snapshot and a "flushed" snapshot differently, either by adding an "fsync" property to the snapshot file that would be false after a flush() but true after a commit(), or by encoding the property within the snapshot filename. The file purger would have to ensure that all index files referenced by either the last committed snapshot or the last flushed snapshot were off limits. A rollback() would zap all changes since the last commit().

Such a scheme allows the the top level app to avoid the costs of fsync while maintaining its own transaction log – perhaps with the optimizations suggested above (separate disk, SSD, etc).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

FWIW, autoCommit doesn't really have a place in Lucy's one-segment-per-indexing-session model.

Well, autoCommit just means "periodically call commit". So, if you decide to offer a commit() operation, then autoCommit would just wrap that? But, I don't think autoCommit should be offered... app should decide.

Revisiting the #2120 threads, one passage stood out:

http://www\.gossamer-threads\.com/lists/lucene/java-dev/54321#54321

This is why in a db system, the only file that is sync'd is the log file - all other files can be made "in sync" from the log file - and this file is normally striped for optimum write performance. Some systems have special "log file drives" (some even solid state, or battery backed ram) to aid the performance.

The fact that we have to sync all files instead of just one seems sub-optimal.

Yes, but, that cost is not on the reopen path, so it's much less important. Ie, the app can freely choose how frequently it wants to commit, completely independent from how often it needs to reopen.

Yet Lucene is not well set up to maintain a transaction log. The very act of adding a document to Lucene is inherently lossy even if all fields are stored, because doc boost is not preserved.

I don't see that those two statements are related.

One can "easily" (meaning, it's easily decoupled from core) make a transaction log on top of lucene – just serialize your docs/analzyer selection/etc to the log & sync it periodically.

But, that's orthogonal to what Lucene does & doesn't preserve in its index (and, yes, Lucene doesn't precisely preserve boosts).

Also, having the app explicitly decouple these two notions keeps the door open for future improvements. If we force absolutely all sharing to go through the filesystem then that limits the improvements we can make to NRT.

However, Lucy has much more to gain going through the file system than Lucene does, because we don't necessarily incur JVM startup costs when launching a new process. The Lucene approach to NRT - specialized reader hanging off of writer - is constrained to a single process. The Lucy approach - fast index opens enabled by mmap-friendly index formats - is not.

The two approaches aren't mutually exclusive. It will be possible to augment Lucy with a specialized index reader within a single process. However, A) there seems to be a lot of disagreement about just how to integrate that reader, and B) there seem to be ways to bolt that functionality on top of the existing classes. Under those circumstances, I think it makes more sense to keep that feature external for now.

Again: NRT is not a "specialized reader". It's a normal read-only DirectoryReader, just like you'd get from IndexReader.open, with the only difference being that it consulted IW to find which segments to open. Plus, it's pooled, so that if IW already has a given segment reader open (say because deletes were applied or merges are running), it's reused.

We've discussed making it specialized (eg directly asearching DW's ram buffer, caching recently flushed segments in RAM, special incremental-copy-on-write data structures for deleted docs, etc.) but so far these changes don't seem worthwhile.

The current approach to NRT is simple... I haven't yet seen performance gains strong enough to justify moving to "specialized readers".

Yes, Lucene's approach must be in the same JVM. But we get important gains from this – reusing a single reader (the pool), carrying over merged deletions directly in RAM (and eventually field cache & norms too – LUCENE-1785).

Instead, Lucy (by design) must do all sharing & access all index data through the filesystem (a decision, I think, could be dangerous), which will necessarily increase your reopen time. Maybe in practice that cost is small though... the OS write cache should keep everything fresh... but you still must serialize.

Alternatively, you could keep the notion "flush" (an unsafe commit) alive? You write the segments file, but make no effort to ensure it's durability (and also preserve the last "true" commit). Then a normal IR.reopen suffices...

That sounds promising. The semantics would differ from those of Lucene's flush(), which doesn't make changes visible.

We could implement this by somehow marking a "committed" snapshot and a "flushed" snapshot differently, either by adding an "fsync" property to the snapshot file that would be false after a flush() but true after a commit(), or by encoding the property within the snapshot filename. The file purger would have to ensure that all index files referenced by either the last committed snapshot or the last flushed snapshot were off limits. A rollback() would zap all changes since the last commit().

Such a scheme allows the the top level app to avoid the costs of fsync while maintaining its own transaction log - perhaps with the optimizations suggested above (separate disk, SSD, etc).

In fact, this would make Lucy's approach to NRT nearly identical to Lucene NRT.

The only difference is, instead of getting the current uncommitted segments_N via RAM, Lucy uses the filesystem. And, of course Lucy doesn't pool readers. So this is really a Lucy-ification of Lucene's approach to NRT.

So it has the same benefits as Lucene's NRT, ie, lets Lucy apps decouple decisions about safety (commit) and freshness (reopen turnaround time).

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> Well, autoCommit just means "periodically call commit". So, if you > decide to offer a commit() operation, then autoCommit would just wrap > that? But, I don't think autoCommit should be offered... app should > decide.

Agreed, autoCommit had benefits under legacy Lucene, but wouldn't be important now. If we did add some sort of "automatic commit" feature, it would mean something else: commit every change instantly. But that's easy to implement via a wrapper, so there's no point cluttering the the primary index writer class to support such a feature.

> Again: NRT is not a "specialized reader". It's a normal read-only > DirectoryReader, just like you'd get from IndexReader.open, with the > only difference being that it consulted IW to find which segments to > open. Plus, it's pooled, so that if IW already has a given segment > reader open (say because deletes were applied or merges are running), > it's reused.

Well, it seems to me that those two features make it special – particularly the pooling of SegmentReaders. You can't take advantage of that outside the context of IndexWriter:

> Yes, Lucene's approach must be in the same JVM. But we get important > gains from this - reusing a single reader (the pool), carrying over > merged deletions directly in RAM (and eventually field cache & norms > too - LUCENE-1785).

Exactly. In my view, that's what makes that reader "special": unlike ordinary Lucene IndexReaders, this one springs into being with its caches already primed rather than in need of lazy loading.

But to achieve those benefits, you have to mod the index writing process. Those modifications are not necessary under the Lucy model, because the mere act of writing the index stores our data in the system IO cache.

> Instead, Lucy (by design) must do all sharing & access all index data > through the filesystem (a decision, I think, could be dangerous), > which will necessarily increase your reopen time.

Dangerous in what sense?

Going through the file system is a tradeoff, sure – but it's pretty nice to design your low-latency search app free from any concern about whether indexing and search need to be coordinated within a single process. Furthermore, if separate processes are your primary concurrency model, going through the file system is actually mandatory to achieve best performance on a multi-core box. Lucy won't always be used with multi-threaded hosts.

I actually think going through the file system is dangerous in a different sense: it puts pressure on the file format spec. The easy way to achieve IPC between writers and readers will be to dump stuff into one of the JSON files to support the killer-feature-du-jour – such as what I'm proposing with this "fsync" key in the snapshot file. But then we wind up with a bunch of crap cluttering up our index metadata files. I'm determined that Lucy will have a more coherent file format than Lucene, but with this IPC requirement we're setting our community up to push us in the wrong direction. If we're not careful, we could end up with a file format that's an unmaintainable jumble.

But you're talking performance, not complexity costs, right?

> Maybe in practice that cost is small though... the OS write cache should > keep everything fresh... but you still must serialize.

Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records and 900 MB worth of sort cache data; opening a fresh searcher and loading all sort caches takes circa 21 ms.

There's room to improve that further – we haven't yet implemented IndexReader.reopen() – but that was fast enough to achieve what we wanted to achieve.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records and 900 MB worth of sort cache data; opening a fresh searcher and loading all sort caches takes circa 21 ms.

Marvin, very cool! Are you using the mmap module you mentioned at ApacheCon?

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

Yes, this is using the sort cache model worked out this spring on lucy-dev. The memory mapping happens within FSFileHandle (LUCY-83). SortWriter and SortReader haven't made it into the Lucy repository yet.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Again: NRT is not a "specialized reader". It's a normal read-only DirectoryReader, just like you'd get from IndexReader.open, with the only difference being that it consulted IW to find which segments to open. Plus, it's pooled, so that if IW already has a given segment reader open (say because deletes were applied or merges are running), it's reused.

Well, it seems to me that those two features make it special - particularly the pooling of SegmentReaders. You can't take advantage of that outside the context of IndexWriter:

OK so mabye a little special ;) But, really that pooling should be factored out of IW. It's not writer specific.

Yes, Lucene's approach must be in the same JVM. But we get important gains from this - reusing a single reader (the pool), carrying over merged deletions directly in RAM (and eventually field cache & norms too - LUCENE-1785).

Exactly. In my view, that's what makes that reader "special": unlike ordinary Lucene IndexReaders, this one springs into being with its caches already primed rather than in need of lazy loading.

But to achieve those benefits, you have to mod the index writing process.

Mod the index writing, and the reader reopen, to use the shared pool. The pool in itself isn't writer specific.

Really the pool is just like what you tap into when you call reopen – that method looks at the current "pool" of already opened segments, sharing what it can.

Those modifications are not necessary under the Lucy model, because the mere act of writing the index stores our data in the system IO cache.

But, that's where Lucy presumably takes a perf hit. Lucene can share these in RAM, not usign the filesystem as the intermediary (eg we do that today with deletions; norms/field cache/eventual CSF can do the same.) Lucy must go through the filesystem to share.

Instead, Lucy (by design) must do all sharing & access all index data through the filesystem (a decision, I think, could be dangerous), which will necessarily increase your reopen time.

Dangerous in what sense?

Going through the file system is a tradeoff, sure - but it's pretty nice to design your low-latency search app free from any concern about whether indexing and search need to be coordinated within a single process. Furthermore, if separate processes are your primary concurrency model, going through the file system is actually mandatory to achieve best performance on a multi-core box. Lucy won't always be used with multi-threaded hosts.

I actually think going through the file system is dangerous in a different sense: it puts pressure on the file format spec. The easy way to achieve IPC between writers and readers will be to dump stuff into one of the JSON files to support the killer-feature-du-jour - such as what I'm proposing with this "fsync" key in the snapshot file. But then we wind up with a bunch of crap cluttering up our index metadata files. I'm determined that Lucy will have a more coherent file format than Lucene, but with this IPC requirement we're setting our community up to push us in the wrong direction. If we're not careful, we could end up with a file format that's an unmaintainable jumble.

But you're talking performance, not complexity costs, right?

Mostly I was thinking performance, ie, trusting the OS to make good decisions about what should be RAM resident, when it has limited information...

But, also risky is that all important data structures must be "file-flat", though in practice that doesn't seem like an issue so far? The RAM resident things Lucene has – norms, deleted docs, terms index, field cache – seem to "cast" just fine to file-flat. If we switched to an FST for the terms index I guess that could get tricky...

Wouldn't shared memory be possible for process-only concurrent models? Also, what popular systems/environments have this requirement (only process level concurrency) today?

It's wonderful that Lucy can startup really fast, but, for most apps that's not nearly as important as searching/indexing performance, right? I mean, you start only once, and then you handle many, many searches / index many documents, with that process, usually?

Maybe in practice that cost is small though... the OS write cache should keep everything fresh... but you still must serialize.

Anecdotally, at Eventful one of our indexes is 5 GB with 16 million records and 900 MB worth of sort cache data; opening a fresh searcher and loading all sort caches takes circa 21 ms.

That's fabulously fast!

But you really need to also test search/indexing throughput, reopen time (I think) once that's online for Lucy...

There's room to improve that further - we haven't yet implemented IndexReader.reopen() - but that was fast enough to achieve what we wanted to achieve.

Is reopen even necessary in Lucy?

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> But, that's where Lucy presumably takes a perf hit. Lucene can share > these in RAM, not usign the filesystem as the intermediary (eg we do > that today with deletions; norms/field cache/eventual CSF can do the > same.) Lucy must go through the filesystem to share.

For a flush(), I don't think there's a significant penalty. The only extra costs Lucy will pay are the bookkeeping costs to update the file system state and to create the objects that read the index data. Those are real, but since we're skipping the fsync(), they're small. As far as the actual data, I don't see that there's a difference. Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM.

If we have to fsync(), there'll be a cost, but in Lucene you have to pay that same cost, too. Lucene expects to get around it with IndexWriter.getReader(). In Lucy, we'll get around it by having you call flush() and then reopen a reader somewhere, often in another proecess.

In both cases, the availability of fresh data is decoupled from the fsync.
In both cases, the indexing process has to be careful about dropping data on the floor before a commit() succeeds.
In both cases, it's possible to protect against unbounded corruption by rolling back to the last commit.

> Mostly I was thinking performance, ie, trusting the OS to make good > decisions about what should be RAM resident, when it has limited > information...

Right, for instance because we generally can't force the OS to pin term dictionaries in RAM, as discussed a while back. It's not an ideal situation, but Lucene's approach isn't bulletproof either, since Lucene's term dictionaries can get paged out too.

We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that.

> But, also risky is that all important data structures must be "file-flat", > though in practice that doesn't seem like an issue so far?

It's a constraint. For instance, to support mmap, string sort caches currently require three "files" each: ords, offsets, and UTF-8 character data.

The compound file system makes the file proliferation bearable, though. And it's actually nice in a way to have data structures as named files, strongly separated from each other and persistent.

If we were willing to ditch portability, we could cast to arrays of structs in Lucy – but so far we've just used primitives. I'd like to keep it that way, since it would be nice if the core Lucy file format was at least theoretically compatible with a pure Java implementation. But Lucy plugins could break that rule and cast to structs if desired.

> The RAM resident things Lucene has - norms, deleted docs, terms index, field > cache - seem to "cast" just fine to file-flat.

There are often benefits to keeping stuff "file-flat", particularly when the file-flat form is compressed. If we were to expand those sort caches to string objects, they'd take up more RAM than they do now.

I think the only significant drawback is security: we can't trust memory mapped data the way we can data which has been read into process RAM and checked on the way in. For instance, we need to perform UTF-8 sanity checking each time a string sort cache value escapes the controlled environment of the cache reader. If the sort cache value was instead derived from an existing string in process RAM, we wouldn't need to check it.

> If we switched to an FST for the terms index I guess that could get > tricky...

Hmm, I haven't been following that. Too much work to keep up with those giganto patches for flex indexing, even though it's a subject I'm intimately acquainted with and deeply interested in. I plan to look it over when you're done and see if we can simplify it. :)

> Wouldn't shared memory be possible for process-only concurrent models?

IPC is a platform-compatibility nightmare. By restricting ourselves to communicating via the file system, we save ourselves oodles of engineering time. And on really boring, frustrating work, to boot.

> Also, what popular systems/environments have this requirement (only process > level concurrency) today?

Perl's threads suck. Actually all threads suck. Perl's are just worse than average – and so many Perl binaries are compiled without them. Java threads suck less, but they still suck – look how much engineering time you folks blow on managing that stuff. Threads are a terrible programming model.

I'm not into the idea of forcing Lucy users to use threads. They should be able to use processes as their primary concurrency model if they want.

> It's wonderful that Lucy can startup really fast, but, for most apps that's > not nearly as important as searching/indexing performance, right?

Depends.

Total indexing throughput in both Lucene and KinoSearch has been pretty decent for a long time. However, there's been a large gap between average index update performance and worst case index update performance, especially when you factor in sort cache loading. There are plenty of applications that may not have very high throughput requirements but where it may not be acceptable for an index update to take several seconds or several minutes every once in a while, even if it usually completes faster.

> I mean, you start only once, and then you handle many, many > searches / index many documents, with that process, usually?

Sometimes the person who just performed the action that updated the index is the only one you care about. For instance, to use a feature request that came in from Slashdot a while back, if someone leaves a comment on your website, it's nice to have it available in the search index right away.

Consistently fast index update responsiveness makes personalization of the customer experience easier.

> But you really need to also test search/indexing throughput, reopen time > (I think) once that's online for Lucy...

Naturally.

> Is reopen even necessary in Lucy?

Probably. If you have a boatload of segments and a boatload of fields, you might start to see file opening and metadata parsing costs come into play. If it turns out that for some indexes reopen() can knock down the time from say, 100 ms to 10 ms or less, I'd consider that sufficient justification.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But, that's where Lucy presumably takes a perf hit. Lucene can share these in RAM, not usign the filesystem as the intermediary (eg we do that today with deletions; norms/field cache/eventual CSF can do the same.) Lucy must go through the filesystem to share.

For a flush(), I don't think there's a significant penalty. The only extra costs Lucy will pay are the bookkeeping costs to update the file system state and to create the objects that read the index data. Those are real, but since we're skipping the fsync(), they're small. As far as the actual data, I don't see that there's a difference.

But everything must go through the filesystem with Lucy...

Eg, with Lucene, deletions are not written to disk until you commit. Flush doesn't write the del file, merging doesn't, etc. The deletes are carried in RAM. We could (but haven't yet – NRT turnaround time is already plenty fast) do the same with norms, field cache, terms dict index, etc.

Reading from memory mapped RAM isn't any slower than reading from malloc'd RAM.

Right, for instance because we generally can't force the OS to pin term dictionaries in RAM, as discussed a while back. It's not an ideal situation, but Lucene's approach isn't bulletproof either, since Lucene's term dictionaries can get paged out too.

As long as the page is hot... (in both cases!).

But by using file-backed RAM (not malloc'd RAM), you're telling the OS it's OK if it chooses to swap it out. Sure, malloc'd RAM can be swapped out too... but that should be less frequent (and, we can control this behavior, somewhat, eg swappiness).

It's similar to using a weak v strong reference in java. By using file-backed RAM you tell the OS it's fair game for swapping.

If we have to fsync(), there'll be a cost, but in Lucene you have to pay that same cost, too. Lucene expects to get around it with IndexWriter.getReader(). In Lucy, we'll get around it by having you call flush() and then reopen a reader somewhere, often in another proecess.

In both cases, the availability of fresh data is decoupled from the fsync. In both cases, the indexing process has to be careful about dropping data on the floor before a commit() succeeds. In both cases, it's possible to protect against unbounded corruption by rolling back to the last commit.

The two approaches are basically the same, so, we get the same features ;)

It's just that Lucy uses the filesystem for sharing, and Lucene shares through RAM.

We're sure not going to throw away all the advantages of mmap and go back to reading data structures into process RAM just because of that.

I guess my confusion is what are all the other benefits of using file-backed RAM? You can efficiently use process only concurrency (though shared memory is technically an option for this too), and you have wicked fast open times (but, you still must warm, just like Lucene). What else? Oh maybe the ability to inform OS not to cache eg the reads done when merging segments. That's one I sure wish Lucene could use...

In exchange you risk the OS making poor choices about what gets swapped out (LRU policy is too simplistic... not all pages are created equal), must down cast all data structures to file-flat, must share everything through the filesystem, (perf hit to NRT).

I do love how pure the file-backed RAM approach is, but I worry that down the road it'll result in erratic search performance in certain app profiles.

But, also risky is that all important data structures must be "file-flat", though in practice that doesn't seem like an issue so far?

It's a constraint. For instance, to support mmap, string sort caches currently require three "files" each: ords, offsets, and UTF-8 character data.

Yeah, that you need 3 files for the string sort cache is a little spooky... that's 3X the chance of a page fault.

The compound file system makes the file proliferation bearable, though. And it's actually nice in a way to have data structures as named files, strongly separated from each other and persistent.

But the CFS construction must also go through the filesystem (like Lucene) right? So you still incur IO load of creating the small files, then 2nd pass to consolidate.

I agree there's a certain design purity to having the files clearly separate out the elements of the data structures, but if it means erratic search performance... function over form?

If we were willing to ditch portability, we could cast to arrays of structs in Lucy - but so far we've just used primitives. I'd like to keep it that way, since it would be nice if the core Lucy file format was at least theoretically compatible with a pure Java implementation. But Lucy plugins could break that rule and cast to structs if desired.

Someday we could make a Lucene codec that interacts with a Lucy index... would be a good exercise to go though to see if the flex API really is "flex" enough...

The RAM resident things Lucene has - norms, deleted docs, terms index, field cache - seem to "cast" just fine to file-flat.

There are often benefits to keeping stuff "file-flat", particularly when the file-flat form is compressed. If we were to expand those sort caches to string objects, they'd take up more RAM than they do now.

We've leaving them as UTF8 by default for Lucene (with the flex changes). Still, the terms index once loaded does have silly RAM overhead... we can cut that back a fair amount though.

I think the only significant drawback is security: we can't trust memory mapped data the way we can data which has been read into process RAM and checked on the way in. For instance, we need to perform UTF-8 sanity checking each time a string sort cache value escapes the controlled environment of the cache reader. If the sort cache value was instead derived from an existing string in process RAM, we wouldn't need to check it.

Sigh, that's a curious downside... so term decode intensive uses (merging, range queries, I guess maybe term dict lookup) take the brunt of that hit?

If we switched to an FST for the terms index I guess that could get tricky...

Hmm, I haven't been following that.

There's not much to follow – it's all just talk at this point. I don't think anyone's built a prototype yet ;)

Too much work to keep up with those giganto patches for flex indexing, even though it's a subject I'm intimately acquainted with and deeply interested in. I plan to look it over when you're done and see if we can simplify it.

And then we'll borrow back your simplifications ;) Lather, rinse, repeat.

Wouldn't shared memory be possible for process-only concurrent models?

IPC is a platform-compatibility nightmare. By restricting ourselves to communicating via the file system, we save ourselves oodles of engineering time. And on really boring, frustrating work, to boot.

I had assumed so too, but I was surprised that Python's multiprocessing module exposes a simple API for sharing objects from parent to forked child. It's at least a counter example (though, in all fairness, I haven't looked at the impl ;) ), ie, there seems to be some hope of containing shared memory under a consistent API.

I'm just pointing out that "going through the filesystem" isn't the only way to have efficient process-only concurrency. Shared memory is another option, but, yes it has tradeoffs.

Also, what popular systems/environments have this requirement (only process level concurrency) today?

Perl's threads suck. Actually all threads suck. Perl's are just worse than average - and so many Perl binaries are compiled without them. Java threads suck less, but they still suck - look how much engineering time you folks blow on managing that stuff. Threads are a terrible programming model.

I'm not into the idea of forcing Lucy users to use threads. They should be able to use processes as their primary concurrency model if they want.

Yes, working with threads is a nightmare (eg have a look at Java's memory model). I think the jury is still out (for our species) just how, long term, we'll make use of concurrency with the machines. I think we may need to largely take "time" out of our programming languages, eg switch to much more declarative code, or something... wanna port Lucy to Erlang?

But I'm not sure process only concurrency, sharing only via file-backed memory, is the answer either ;)

It's wonderful that Lucy can startup really fast, but, for most apps that's not nearly as important as searching/indexing performance, right?

Depends.

Total indexing throughput in both Lucene and KinoSearch has been pretty decent for a long time. However, there's been a large gap between average index update performance and worst case index update performance, especially when you factor in sort cache loading. There are plenty of applications that may not have very high throughput requirements but where it may not be acceptable for an index update to take several seconds or several minutes every once in a while, even if it usually completes faster.

I mean, you start only once, and then you handle many, many searches / index many documents, with that process, usually?

Sometimes the person who just performed the action that updated the index is the only one you care about. For instance, to use a feature request that came in from Slashdot a while back, if someone leaves a comment on your website, it's nice to have it available in the search index right away.

Consistently fast index update responsiveness makes personalization of the customer experience easier.

Turnaround time for Lucene NRT is already very fast, as is. After an immense merge, it'll be the worst, but if you warm the reader first, that won't be an issue.

Using Zoie you can make reopen time insanely fast (much faster than I think necessary for most apps), but at the expense of some expected hit to searching/indexing throughput. I don't think that's the right tradeoff for Lucene.

I suspect Lucy is making a similar tradeoff, ie, that search performance will be erratic due to page faults, at a smallish gain in reopen time.

Do you have any hard numbers on how much time it takes Lucene to load from a hot IO cache, populating its RAM resident data structures? I wonder in practice what extra cost we are really talking about... it's RAM to RAM "translation" of data structures (if the files are hot). FieldCache we just have to fix to stop doing uninversion... (ie we need CSF).

Is reopen even necessary in Lucy?

Probably. If you have a boatload of segments and a boatload of fields, you might start to see file opening and metadata parsing costs come into play. If it turns out that for some indexes reopen() can knock down the time from say, 100 ms to 10 ms or less, I'd consider that sufficient justification.

OK. Then, you are basically pooling your readers ;) Ie, you do allow in-process sharing, but only among readers.

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> I guess my confusion is what are all the other benefits of using > file-backed RAM? You can efficiently use process only concurrency > (though shared memory is technically an option for this too), and you > have wicked fast open times (but, you still must warm, just like > Lucene).

Processes are Lucy's primary concurrency model. ("The OS is our JVM.") Making process-only concurrency efficient isn't optional – it's a core concern.

> What else? Oh maybe the ability to inform OS not to cache > eg the reads done when merging segments. That's one I sure wish > Lucene could use...

Lightweight searchers mean architectural freedom.

Create 2, 10, 100, 1000 Searchers without a second thought – as many as you need for whatever app architecture you just dreamed up – then destroy them just as effortlessly. Add another worker thread to your search server without having to consider the RAM requirements of a heavy searcher object. Create a command-line app to search a documentation index without worrying about daemonizing it. Etc.

If your normal development pattern is a single monolithic Java process, then that freedom might not mean much to you. But with their low per-object RAM requirements and fast opens, lightweight searchers are easy to use within a lot of other development patterns. For example: lightweight searchers work well for maxing out multiple CPU cores under process-only concurrency.

> In exchange you risk the OS making poor choices about what gets > swapped out (LRU policy is too simplistic... not all pages are created > equal),

The Linux virtual memory system, at least, is not a pure LRU. It utilizes a page aging algo which prioritizes pages that have historically been accessed frequently even when they have not been accessed recently:

http://sunsite.nus.edu.sg/LDP/LDP/tlk/node40.html

The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched (by the memory management subsystem) it's age is increased by 3 to a maximum of 20. Each time the Kernel swap daemon runs it ages pages, decrementing their age by 1.

And while that system may not be ideal from our standpoint, it's still pretty good. In general, the operating system's virtual memory scheme is going to work fine as designed, for us and everyone else, and minimize memory availability wait times.

When will swapping out the term dictionary be a problem?

For indexes where queries are made frequently, no problem.
Foir systems with plenty of RAM, no problem.
For systems that aren't very busy, no problem.
~~For small indexes, no problem.~~

The only situation we're talking about is infrequent queries against ~~large~~ indexes on busy boxes where RAM isn't abundant. Under those circumstances, it might be noticable that Lucy's term dictionary gets paged out somewhat sooner than Lucene's.

But in general, if the term dictionary gets paged out, so what? Nobody was using it. Maybe nobody will make another query against that index until next week. Maybe the OS made the right decision.

OK, so there's a vulnerable bubble where the the query rate against ~~a large index~~ an index is neither too fast nor too slow, on busy machines where RAM isn't abundant. I don't think that bubble ought to drive major architectural decisions.

Let me turn your question on its head. What does Lucene gain in return for the slow index opens and large process memory footprint of its heavy searchers?

> I do love how pure the file-backed RAM approach is, but I worry that > down the road it'll result in erratic search performance in certain > app profiles.

If necessary, there's a straightforward remedy: slurp the relevant files into RAM at object construction rather than mmap them. The rest of the code won't know the difference between malloc'd RAM and mmap'd RAM. The slurped files won't take up any more space than the analogous Lucene data structures; more likely, they'll take up less.

That's the kind of setting we'd hide away in the IndexManager class rather than expose as prominent API, and it would be a hint to index components rather than an edict.

> Yeah, that you need 3 files for the string sort cache is a little > spooky... that's 3X the chance of a page fault.

Not when using the compound format.

> But the CFS construction must also go through the filesystem (like > Lucene) right? So you still incur IO load of creating the small > files, then 2nd pass to consolidate.

Yes.

> I think we may need to largely take "time" out of our programming > languages, eg switch to much more declarative code, or > something... wanna port Lucy to Erlang? > > But I'm not sure process only concurrency, sharing only via > file-backed memory, is the answer either

I think relying heavily on file-backed memory is particularly appropriate for Lucy because the write-once file format works well with MAP_SHARED memory segments. If files were being modified and had to be protected with semaphores, it wouldn't be as sweet a match.

Focusing on process-only concurrency also works well for Lucy because host threading models differ substantially and so will only be accessible via a generalized interface from the Lucy C core. It will be difficult to tune threading performance through that layer of indirection – I'm guessing beyond the ability of most developers since few will be experts in multiple host threading models. In contrast, expertise in process level concurrency will be easier to come by and to nourish.

> Using Zoie you can make reopen time insanely fast (much faster than I > think necessary for most apps), but at the expense of some expected > hit to searching/indexing throughput. I don't think that's the right > tradeoff for Lucene.

But as Jake pointed out early in the thread, Zoie achieves those insanely fast reopens without tight coupling to IndexWriter and its components. The auxiliary RAM index approach is well proven.

> Do you have any hard numbers on how much time it takes Lucene to load > from a hot IO cache, populating its RAM resident data structures?

Hmm, I don't spend a lot of time working with Lucene directly, so I might not be the person most likely to have data like that at my fingertips. Maybe that McCandless dude can help you out, he runs a lot of benchmarks. ;)

Or maybe ask the Solr folks? I see them on solr-user all the time talking about "MaxWarmingSearchers". ;)

> OK. Then, you are basically pooling your readers Ie, you do allow > in-process sharing, but only among readers.

Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for each new segment, but they would be private to each parent PolyReader. So if you reopened two IndexReaders at the same time after e.g. segment "seg_12" had been added, each would create a new, private SegReader for "seg_12".

Edit: updated to correct assertions about virtual memory performance with small indexes.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Processes are Lucy's primary concurrency model. ("The OS is our JVM.") Making process-only concurrency efficient isn't optional - it's a core concern.

Lightweight searchers mean architectural freedom.

Create 2, 10, 100, 1000 Searchers without a second thought - as many as you need for whatever app architecture you just dreamed up - then destroy them just as effortlessly. Add another worker thread to your search server without having to consider the RAM requirements of a heavy searcher object. Create a command-line app to search a documentation index without worrying about daemonizing it. Etc.

This is definitely neat.

The Linux virtual memory system, at least, is not a pure LRU. It utilizes a page aging algo which prioritizes pages that have historically been accessed frequently even when they have not been accessed recently:

http://sunsite\.nus\.edu\.sg/LDP/LDP/tlk/node40\.html

Very interesting – thanks. So it also factors in how much the page was used in the past, not just how long it's been since the page was last used.

When will swapping out the term dictionary be a problem?

For indexes where queries are made frequently, no problem. Foir systems with plenty of RAM, no problem. For systems that aren't very busy, no problem. For small indexes, no problem. The only situation we're talking about is infrequent queries against large indexes on busy boxes where RAM isn't abundant. Under those circumstances, it might be noticable that Lucy's term dictionary gets paged out somewhat sooner than Lucene's.

Even smallish indexes can see the pages swapped out? I'd think at low-to-moderate search traffic, any index could be at risk, depdending on whether other stuff in the machine wanting RAM or IO cache is running.

But in general, if the term dictionary gets paged out, so what? Nobody was using it. Maybe nobody will make another query against that index until next week. Maybe the OS made the right decision.

You can't afford many page faults until the latency becomes very apparent (until we're all on SSDs... at which point this may all be moot).

Right – the metric that the swapper optimizes is overall efficient use of the machine's resources.

But I think that's often a poor metric for search apps... I think consistency on the search latency is more important, though I agree it depends very much on the app.

I don't like the same behavior in my desktop – when I switch to my mail client, I don't want to wait 10 seconds for it to swap the pages back in.

Let me turn your question on its head. What does Lucene gain in return for the slow index opens and large process memory footprint of its heavy searchers?

Consistency in the search time. Assuming the OS doesn't swap our pages out...

And of course Java pretty much forces threads-as-concurrency (JVM startup time, hotspot compilation, are costly).

If necessary, there's a straightforward remedy: slurp the relevant files into RAM at object construction rather than mmap them. The rest of the code won't know the difference between malloc'd RAM and mmap'd RAM. The slurped files won't take up any more space than the analogous Lucene data structures; more likely, they'll take up less.

That's the kind of setting we'd hide away in the IndexManager class rather than expose as prominent API, and it would be a hint to index components rather than an edict.

Right, this is how Lucy would force warming.

Yeah, that you need 3 files for the string sort cache is a little spooky... that's 3X the chance of a page fault.

Not when using the compound format.

But, even within that CFS file, these three sub-files will not be local? Ie you'll still have to hit three pages per "lookup" right?

I think relying heavily on file-backed memory is particularly appropriate for Lucy because the write-once file format works well with MAP_SHARED memory segments. If files were being modified and had to be protected with semaphores, it wouldn't be as sweet a match.

Write-once is good for Lucene too.

Focusing on process-only concurrency also works well for Lucy because host threading models differ substantially and so will only be accessible via a generalized interface from the Lucy C core. It will be difficult to tune threading performance through that layer of indirection - I'm guessing beyond the ability of most developers since few will be experts in multiple host threading models. In contrast, expertise in process level concurrency will be easier to come by and to nourish.

I'm confused by this – eg Python does a great job presenting a simple threads interface and implementing it on major OSs. And it seems like Lucy would not need anything crazy-os-specific wrt threads?

Do you have any hard numbers on how much time it takes Lucene to load from a hot IO cache, populating its RAM resident data structures?

Hmm, I don't spend a lot of time working with Lucene directly, so I might not be the person most likely to have data like that at my fingertips. Maybe that McCandless dude can help you out, he runs a lot of benchmarks.

Hmm ;) I'd guess that field cache is slowish; deleted docs & norms are very fast; terms index is somewhere in between.

Or maybe ask the Solr folks? I see them on solr-user all the time talking about "MaxWarmingSearchers".

Hmm – not sure what's up with that. Looks like maybe it's the auto-warming that might happen after a commit.

OK. Then, you are basically pooling your readers Ie, you do allow in-process sharing, but only among readers.

Not sure about that. Lucy's IndexReader.reopen() would open new SegReaders for each new segment, but they would be private to each parent PolyReader. So if you reopened two IndexReaders at the same time after e.g. segment "seg_12" had been added, each would create a new, private SegReader for "seg_12".

You're right, you'd get two readers for seg_12 in that case. By "pool" I meant you're tapping into all the sub-readers that the existing reader have opened – the reader is your pool of sub-readers.

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> Very interesting - thanks. So it also factors in how much the page > was used in the past, not just how long it's been since the page was > last used.

In theory, I think that means the term dictionary will tend to be favored over the posting lists. In practice... hard to say, it would be difficult to test. :(

> Even smallish indexes can see the pages swapped out?

Yes, you're right – the wait time to get at a small term dictionary isn't necessarily small. I've amended my previous post, thanks.

> And of course Java pretty much forces threads-as-concurrency (JVM > startup time, hotspot compilation, are costly).

Yes. Java does a lot of stuff that most operating systems can also do, but of course provides a coherent platform-independent interface. In Lucy we're going to try to go back to the OS for some of the stuff that Java likes to take over – provided that we can develop a sane genericized interface using configuration probing and #ifdefs.

It's nice that as long as the box is up our OS-as-JVM is always running, so we don't have to worry about its (quite lengthy) startup time.

> Right, this is how Lucy would force warming.

I think slurp-instead-of-mmap is orthogonal to warming, because we can warm file-backed RAM structures by forcing them into the IO cache, using either the cat-to-dev-null trick or something more sophisticated. The slurp-instead-of-mmap setting would cause warming as a side effect, but the main point would be to attempt to persuade the virtual memory system that certain data structures should have a higher status and not be paged out as quickly.

> But, even within that CFS file, these three sub-files will not be > local? Ie you'll still have to hit three pages per "lookup" right?

They'll be next to each other in the compound file because CompoundFileWriter orders them alphabetically. For big segments, though, you're right that they won't be right next to each other, and you could possibly incur as many as three page faults when retrieving a sort cache value.

But what are the alternatives for variable width data like strings? You need the ords array anyway for efficient comparisons, so what's left are the offsets array and the character data.

An array of String objects isn't going to have better locality than one solid block of memory dedicated to offsets and another solid block of memory dedicated to file data, and it's no fewer derefs even if the string object stores its character data inline – more if it points to a separate allocation (like Lucy's CharBuf does, since it's mutable).

For each sort cache value lookup, you're going to need to access two blocks of memory.

With the array of String objects, the first is the memory block dedicated to the array, and the second is the memory block dedicated to the String object itself, which contains the character data.
With the file-backed block sort cache, the first memory block is the offsets array, and the second is the character data array.

I think the locality costs should be approximately the same... have I missed anything?

> Write-once is good for Lucene too.

Hellyeah.

> And it seems like Lucy would not need anything crazy-os-specific wrt > threads?

It depends on how many classes we want to make thread-safe, and it's not just the OS, it's the host.

The bare minimum is simply to make Lucy thread-safe as a library. That's pretty close, because Lucy studiously avoided global variables whenever possible. The only problems that have to be addressed are the VTable_registry Hash, race conditions when creating new subclasses via dynamic VTable singletons, and refcounts on the VTable objects themselves.

Once those issues are taken care of, you'll be able to use Lucy objects in separate threads with no problem, e.g. one Searcher per thread.

However, if you want to share Lucy objects (other than VTables) across threads, all of a sudden we have to start thinking about "synchronized", "volatile", etc. Such constructs may not be efficient or even possible under some threading models.

> Hmm I'd guess that field cache is slowish; deleted docs & norms are > very fast; terms index is somewhere in between.

That jibes with my own experience. So maybe consider file-backed sort caches in Lucene, while keeping the status quo for everything else?

> You're right, you'd get two readers for seg_12 in that case. By > "pool" I meant you're tapping into all the sub-readers that the > existing reader have opened - the reader is your pool of sub-readers.

Each unique SegReader will also have dedicated "sub-reader" objects: two "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12" PostingsReaders, etc. However, all those sub-readers will share the same file-backed RAM data, so in that sense they're pooled.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Very interesting - thanks. So it also factors in how much the page was used in the past, not just how long it's been since the page was last used.

In theory, I think that means the term dictionary will tend to be favored over the posting lists. In practice... hard to say, it would be difficult to test.

Right... though, I think the top "trunks" frequently used by the binary search, will stay hot. But as you get deeper into the terms index, it's not as clear.

And of course Java pretty much forces threads-as-concurrency (JVM startup time, hotspot compilation, are costly).

Yes. Java does a lot of stuff that most operating systems can also do, but of course provides a coherent platform-independent interface. In Lucy we're going to try to go back to the OS for some of the stuff that Java likes to take over - provided that we can develop a sane genericized interface using configuration probing and #ifdefs.

It's nice that as long as the box is up our OS-as-JVM is always running, so we don't have to worry about its (quite lengthy) startup time.

OS as JVM is a nice analogy. Java of course gets in the way, too, like we cannot properly set IO priorities, we can't give hints to the OS to tell it not to cache certain reads/writes (ie segment merging), can't pin pages ;), etc.

Right, this is how Lucy would force warming.

I think slurp-instead-of-mmap is orthogonal to warming, because we can warm file-backed RAM structures by forcing them into the IO cache, using either the cat-to-dev-null trick or something more sophisticated. The slurp-instead-of-mmap setting would cause warming as a side effect, but the main point would be to attempt to persuade the virtual memory system that certain data structures should have a higher status and not be paged out as quickly.

Woops, sorry, I misread – now I understand. You can easily make certain files ram resident, and then be like Lucene (except the data structures are more compact). Nice.

But, even within that CFS file, these three sub-files will not be local? Ie you'll still have to hit three pages per "lookup" right?

They'll be next to each other in the compound file because CompoundFileWriter orders them alphabetically. For big segments, though, you're right that they won't be right next to each other, and you could possibly incur as many as three page faults when retrieving a sort cache value.

But what are the alternatives for variable width data like strings? You need the ords array anyway for efficient comparisons, so what's left are the offsets array and the character data.

An array of String objects isn't going to have better locality than one solid block of memory dedicated to offsets and another solid block of memory dedicated to file data, and it's no fewer derefs even if the string object stores its character data inline - more if it points to a separate allocation (like Lucy's CharBuf does, since it's mutable).

For each sort cache value lookup, you're going to need to access two blocks of memory.

With the array of String objects, the first is the memory block dedicated to the array, and the second is the memory block dedicated to the String object itself, which contains the character data. With the file-backed block sort cache, the first memory block is the offsets array, and the second is the character data array. I think the locality costs should be approximately the same... have I missed anything?

You're right, Lucene risks 3 (ord array, String array, String object) page faults on each lookup as well.

Actually why can't ord & offset be one, for the string sort cache? Ie, if you write your string data in sort order, then the offsets are also in sort order? (I think we may have discussed this already?)

And it seems like Lucy would not need anything crazy-os-specific wrt threads?

It depends on how many classes we want to make thread-safe, and it's not just the OS, it's the host.

The bare minimum is simply to make Lucy thread-safe as a library. That's pretty close, because Lucy studiously avoided global variables whenever possible. The only problems that have to be addressed are the VTable_registry Hash, race conditions when creating new subclasses via dynamic VTable singletons, and refcounts on the VTable objects themselves.

Once those issues are taken care of, you'll be able to use Lucy objects in separate threads with no problem, e.g. one Searcher per thread.

However, if you want to share Lucy objects (other than VTables) across threads, all of a sudden we have to start thinking about "synchronized", "volatile", etc. Such constructs may not be efficient or even possible under some threading models.

OK it is indeed hairy. You don't want to have to create Lucy's equivalent of the JMM...

Hmm I'd guess that field cache is slowish; deleted docs & norms are very fast; terms index is somewhere in between.

That jibes with my own experience. So maybe consider file-backed sort caches in Lucene, while keeping the status quo for everything else?

Perhaps, but it'd still make me nervous ;) When we get CSF (#2308) online we should make it pluggable enough so that one could create an mmap impl.

You're right, you'd get two readers for seg_12 in that case. By "pool" I meant you're tapping into all the sub-readers that the existing reader have opened - the reader is your pool of sub-readers.

Each unique SegReader will also have dedicated "sub-reader" objects: two "seg_12" SegReaders means two "seg_12" DocReaders, two "seg_12" PostingsReaders, etc. However, all those sub-readers will share the same file-backed RAM data, so in that sense they're pooled.

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> we can't give hints to the OS to tell it not to cache certain reads/writes > (ie segment merging),

For what it's worth, we haven't really solved that problem in Lucy either. The sliding window abstraction we wrapped around mmap/MapViewOfFile largely solved the problem of running out of address space on 32-bit operating systems. However, there's currently no way to invoke madvise through Lucy's IO abstraction layer – it's a little tricky with compound files.

Linux, at least, requires that the buffer supplied to madvise be page-aligned. So, say we're starting off on a posting list, and we want to communicate to the OS that it should treat the region we're about to read as MADV_SEQUENTIAL. If the start of the postings file is in the middle of a 4k page and the file right before it is a term dictionary, we don't want to indicate that that region should be treated as sequential.

I'm not sure how to solve that problem without violating the encapsulation of the compound file model. Hmm, maybe we could store metadata about the virtual files indicating usage patterns (sequential, random, etc.)? Since files are generally part of dedicated data structures whose usage patterns are known at index time.

Or maybe we just punt on that use case and worry only about segment merging.
Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell the OS that it's free to recycle any memory pages associated with it?

> Actually why can't ord & offset be one, for the string sort cache? > Ie, if you write your string data in sort order, then the offsets are > also in sort order? (I think we may have discussed this already?)

Right, we discussed this on lucy-dev last spring:

http://markmail.org/message/epc56okapbgit5lw

Incidentally, some of this thread replays our exchange at the top of

2532 from a year ago. It was fun to go back and reread that: in the

interrim, we've implemented segment-centric search and memory mapped field caches and term dictionaries, both of which were first discussed back then. :)

Ords are great for low cardinality fields of all kinds, but become less efficient for high cardinality primitive numeric fields. For simplicity's sake, the prototype implementation of mmap'd field caches in KS always uses ords.

> You don't want to have to create Lucy's equivalent of the JMM...

The more I think about making Lucy classes thread safe, the harder it seems. :( I'd like to make it possible to share a Schema across threads, for instance, but that means all its Analyzers, etc have to be thread-safe as well, which isn't practical when you start getting into contributed subclasses.

Even if we succeed in getting Folders and FileHandles thread safe, it will be hard for the user to keep track of what they can and can't do across threads. "Don't share anything" is a lot easier to understand.

We reap a big benefit by making Lucy's metaclass infrastructure thread-safe. Beyond that, seems like there's a lot of pain for little gain.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

For what it's worth, we haven't really solved that problem in Lucy either. The sliding window abstraction we wrapped around mmap/MapViewOfFile largely solved the problem of running out of address space on 32-bit operating systems. However, there's currently no way to invoke madvise through Lucy's IO abstraction layer - it's a little tricky with compound files.

Linux, at least, requires that the buffer supplied to madvise be page-aligned. So, say we're starting off on a posting list, and we want to communicate to the OS that it should treat the region we're about to read as MADV_SEQUENTIAL. If the start of the postings file is in the middle of a 4k page and the file right before it is a term dictionary, we don't want to indicate that that region should be treated as sequential.

I'm not sure how to solve that problem without violating the encapsulation of the compound file model. Hmm, maybe we could store metadata about the virtual files indicating usage patterns (sequential, random, etc.)? Since files are generally part of dedicated data structures whose usage patterns are known at index time.

Or maybe we just punt on that use case and worry only about segment merging.

Storing metadata seems OK. It'd be optional for codecs to declare that...

Hmm, wouldn't the act of deleting a file (and releasing all file descriptors) tell the OS that it's free to recycle any memory pages associated with it?

It better!

Actually why can't ord & offset be one, for the string sort cache? Ie, if you write your string data in sort order, then the offsets are also in sort order? (I think we may have discussed this already?)

Right, we discussed this on lucy-dev last spring:

http://markmail\.org/message/epc56okapbgit5lw

OK I'll go try to catch up... but I'm about to drop [sort of] offline for a week and a half! There's alot of reading there! Should be a prereq that we first go back and re-read what we said "the last time"... ;)

Incidentally, some of this thread replays our exchange at the top of

2532 from a year ago. It was fun to go back and reread that: in the

interrim, we've implemented segment-centric search and memory mapped field caches and term dictionaries, both of which were first discussed back then.

Nice!

Ords are great for low cardinality fields of all kinds, but become less efficient for high cardinality primitive numeric fields. For simplicity's sake, the prototype implementation of mmap'd field caches in KS always uses ords.

Right...

You don't want to have to create Lucy's equivalent of the JMM...

The more I think about making Lucy classes thread safe, the harder it seems. I'd like to make it possible to share a Schema across threads, for instance, but that means all its Analyzers, etc have to be thread-safe as well, which isn't practical when you start getting into contributed subclasses.

Even if we succeed in getting Folders and FileHandles thread safe, it will be hard for the user to keep track of what they can and can't do across threads. "Don't share anything" is a lot easier to understand.

We reap a big benefit by making Lucy's metaclass infrastructure thread-safe. Beyond that, seems like there's a lot of pain for little gain.

Yeah. Threads are not easy :(

asfimport commented 12 years ago

Tim A. (migrated from JIRA)

Hi,

I am a Computer Science student from Germany. I would like to contribute to this project under GSoC 2012. I have very good experience in Java. I have some questions to this project, can someone help me? IRC or instant messanger?

Thank You Tim

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Is there anyone who can volunteer to be a mentor for this issue...?

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I would but I am so overloaded with other work right now. I can be the primary mentor if you could help when I am totally blocked.

Hi Tim, as we are in the Apache Foundation and a open source project we make everything public. So if you have questions please go and start a thread on the dev@l.a.o mailing list and I am happy to help you. For GSoC internal or private issues while GSoC is running we can do private communication.

simon

asfimport commented 12 years ago

Tim A. (migrated from JIRA)

Hello Michael, hello Simon,

thanks for the fast response.

So if you have questions please go and start a thread on the dev@l.a.o [...]

Okay, I do this and start a thread. I have some special questions to the task (Refactoring IndexWriter).

For example:

Exist unit tests for the code (IndexWriter.java)?
Where i can find the code/software btw. component? (svn, git etc.)
Which IDE I can use for this project? Your Suggestion (Eclipse)?
What's about coding style guides?
[...]

asfimport commented 11 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Been a long time since this has seen action - pushing out of 4.1.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport commented 8 years ago

Furkan Kamaci (migrated from JIRA)

I would like to apply this issue as aGSoC project if someone is volunteer for being a mentor.

apache / lucene

Refactoring of IndexWriter [LUCENE-2026] #3101

2954

2532 from a year ago. It was fun to go back and reread that: in the

2532 from a year ago. It was fun to go back and reread that: in the