Search on IndexWriter's RAM Buffer [LUCENE-2312]

asfimport commented 14 years ago

In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable.

Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing.

Michael Busch has good suggestions regarding how to handle deletes using max doc ids.
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923

The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

Migrated from LUCENE-2312 by Jason Rutherglen, updated Sep 09 2011 Attachments: LUCENE-2312.patch (versions: 3), LUCENE-2312-FC.patch Linked issues:

3400
- 3649
- 4272
- 4472
- 3422
- CASSANDRA-2915

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

In regards to the terms dictionary, keeping it sorted or not, I think it's best to sort it on demand because otherwise there will be yet another parameter to pass into IW (i.e. sortRAMBufTerms or something like that).

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

I set out implementing a simple method DocumentsWriter.getTerms which should return a sorted array of terms over the current RAM buffer. While I think this can be implemented, there's a lot of code in the index package to handle multiple threads, which is fine, except I'm concerned the interleaving of postings won't perform well. So I think we'd want to implement what's been discussed in #3369, per thread ram buffers. With that change, it seems implementing this issue could be straightforward.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

From #3369: > (b-tree, or, simply sort-on-demand the

first time a query needs it, though that cost increases the larger your RAM segments get, ie, not incremental to the # docs you just added)

For the terms dictionary, perhaps a terms array (this could be a RawPostingList[], or an array of objects with pointers to a RawPostingList with some helper methods like getTerm and compareTo), is kept in sorted order, we then binary search and insert new RawPostingLists/terms into the array. We could implement a 2 dimensional array, allowing us to make a per reader copy of the 1st dimension of array. This would maintain transactional consistency (ie, a reader's array isn't changing as a term enum is traversing in another thread).

Also, we have to solve what happens to a reader using a RAM segment that's been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using that RAM segment have closed.

I don't think we have a choice here?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

For the terms dictionary, perhaps a terms array (this could be a RawPostingList[], or an array of objects with pointers to a RawPostingList with some helper methods like getTerm and compareTo), is kept in sorted order, we then binary search and insert new RawPostingLists/terms into the array. We could implement a 2 dimensional array, allowing us to make a per reader copy of the 1st dimension of array. This would maintain transactional consistency (ie, a reader's array isn't changing as a term enum is traversing in another thread).

I don't think we can do term insertion into an array – that's O(N^2) insertion cost – we should use a btree instead.

Also, we could store the first docID stored into the term, too – this way we could have a ordered collection of terms, that's shared across several open readers even as changes are still being made, but each reader skips a given term if its first docID is greater than the maxDoc it's searching. That'd give us point in time searching even while we add terms with time...

Also, we have to solve what happens to a reader using a RAM segment that's been flushed. Perhaps we don't reuse RAM at that point, ie, rely on GC to reclaim once all readers using that RAM segment have closed.

I don't think we have a choice here?

I think we do have a choice.

EG we could force the reader to cutover to the newly flushed segment (which should be identical to the RAM segment), eg by making [say] a DelegatingSegmentReader.

Still... we'd probably have to not re-use in that case, since there can be queries in-flight stepping through the RAM postings, and, we have no way to accurately detect they are done. But at least with this approach we wouldn't tie up RAM indefinitely...

Or maybe we simply state that the APP must aggressively close NRT readers with time else memory use grows and grows... but I don't really like that. We don't have such a restriction today...

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is this class used? Thanks.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Also, we could store the first docID stored into the term, too - this way we could have a ordered collection of terms, that's shared across several open readers even as changes are still being made, but each reader skips a given term if its first docID is greater than the maxDoc it's searching. That'd give us point in time searching even while we add terms with time...

Exactly. This is what I meant in my comment: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

But I mistakenly said lastDocID; of course firstDocID is correct.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

I'll try to tackle this one!

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

A few notes so far:

IW flush could become thread dependent (eg, it'll only flush for the current doc writer) or maybe it should flush all doc writers? Close will shut down and flush all doc writers.
A new term will first check the hash table for existence (as currently), if it's not in the term hash table only then will it be added to the btree (btw, binary search is O(log N) on average?) This way we're avoiding the somewhat costlier btree existence check per token.
The algorithm for flushing doc writers based on RAM consumption can simply be, on exceed, flush the doc writer consuming the most RAM?
I gutted the PerThread classes, then realized, it's all too intertwined. I'd rather get something working, than spend an excessive amount of time rearranging code that already works.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

IW commitMerge calls docWriter's remapDeletes, a synchronized method to prevent concurrent updates. I'm not sure how we should efficiently block calls to the different DW's.
_mergeInit calls docWriter getDocStoreSegment - unsure what to change
Some of the config settings (such as maxBufferedDocs) can simply be removed from DW, and instead accessed via WriterConfig

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Michael are you also going to [first] tackle truly separating the RAM segments? I think we need this first ...

Mike, Why does DocFieldConsumers have DocFieldConsumer one and two? How is this class used? Thanks.

This is so we can make a "tee" in the indexing chain. Here's the default chain (copied out of comment in DW):

DocConsumer / DocConsumerPerThread
  --> code: DocFieldProcessor / DocFieldProcessorPerThread
    --> DocFieldConsumer / DocFieldConsumerPerThread / DocFieldConsumerPerField
      --> code: DocFieldConsumers / DocFieldConsumersPerThread / DocFieldConsumersPerField
        --> code: DocInverter / DocInverterPerThread / DocInverterPerField
          --> InvertedDocConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
            --> code: TermsHash / TermsHashPerThread / TermsHashPerField
              --> TermsHashConsumer / TermsHashConsumerPerThread / TermsHashConsumerPerField
                --> code: FreqProxTermsWriter / FreqProxTermsWriterPerThread / FreqProxTermsWriterPerField
                --> code: TermVectorsTermsWriter / TermVectorsTermsWriterPerThread / TermVectorsTermsWriterPerField
          --> InvertedDocEndConsumer / InvertedDocConsumerPerThread / InvertedDocConsumerPerField
            --> code: NormsWriter / NormsWriterPerThread / NormsWriterPerField
        --> code: StoredFieldsWriter / StoredFieldsWriterPerThread / StoredFieldsWriterPerField

The tee is so the doc fields can go to both DocInvert (for creating postings & term vectors) and to stored fields writer.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

IW flush could become thread dependent

Right, we want this – different RAM segments should be flushed at different times. This gives us better concurrency since IO/CPU resource consumption will now be more interleaved. While one RAM segment is flushing, the others are still indexing.

A new term will first check the hash table for existence (as currently), if it's not in the term hash table only then will it be added to the btree (btw, binary search is O(log N) on average?) This way we're avoiding the somewhat costlier btree existence check per token.

Yes, we could have btree on-the-side but still use hash for mapping (vs using btree alone). Hash will be faster lookups... btree could be created/updated on demand first time something needs to .next() through the TermsEnum.

{quote The algorithm for flushing doc writers based on RAM consumption can simply be, on exceed, flush the doc writer consuming the most RAM

Sounds good :) The challenge will be balancing things... eg if during the time 1 RAM segment is flushed, the others are able to consume more RAM that was freed up by flushing this one RAM segment, you've got a problem... or maybe at that point you go and flush the next one now using the most RAM, so it'd self balance with time.

This will mean the RAM usage is able to flare up above the high water mark...

I gutted the PerThread classes, then realized, it's all too intertwined. I'd rather get something working, than spend an excessive amount of time rearranging code that already works. {quote}

For starters I would keep the *PerThread, but create multiple DWs? Ie, removing the PerThread layer doesn't have to happen at first.

Or we could do the nuclear option – make a new indexing chain.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

IW commitMerge calls docWriter's remapDeletes, a synchronized method to prevent concurrent updates. I'm not sure how we should efficiently block calls to the different DW's.

Yeah this is because when we buffer a delete Term/Query, the docID we store against it is absolute. It seems like it could/should be relative (ie, within the RAM segment), then remapping wouldn't be needed when a merge commits. I think?

_mergeInit calls docWriter getDocStoreSegment - unsure what to change

It wouldn't anymore once we have private RAM segments: we would no longer share doc stores across segments, meaning merging will always merge doc stores and there's no need to call that method nor have all the logic in SegmentMerger to determine whether doc store merging is required.

This will necessarily be a perf hit when up and building a large index from scratch in a single IW session. Today that index creates one large set of doc stores and never has to merge it while building. This is the biggest perf downside to this change, I think.

But maybe the perf loss will not be so bad, because of bulk merging, in the case when all docs always add the same fields in the same order. Or... if we could fix lucene to always bind the same field name to the same field number (#2811) then we'd always bulk-merge regardless of which & which order app adds fields to docs.

Some of the config settings (such as maxBufferedDocs) can simply be removed from DW, and instead accessed via WriterConfig

Ahh, you mean push IWC down to DW? That sounds great.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Mike, rollback is pausing all threads and calling doc writer abort. This should probably happen across all (per thread) doc writers?

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Well, we need to keep our transactional semantics. So I assume while a flush will happen per doc writer independently, a commit will trigger all (per thread) doc writers to flush. Then a rollback also has to abort all per thread doc writers.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Michael are you also going to [first] tackle truly separating the RAM segments? I think we need this first ...

Yeah I agree. I started working on a patch for separating the doc writers already.

I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded), but slightly different postinglist format (some docs nowadays only have 140 characters ;) ). It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

I got the basics of the term enum working, it can be completed fairly easily. So I moved on to term docs... There we got some work to do? Because we're not storing the skip lists in the ram buffer, currently. I guess we'll need a new FreqProxTermsWriterPerField that stores the skip lists as they're being written? How will that work? Doesn't the multi-level skip list assume a set number of docs?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yes, commit should flush & sync all doc writers, and rollback must abort all of them.

I also have a separate indexing chain prototype working with searchable RAM buffer (single-threaded)

Yay!

but slightly different postinglist format (some docs nowadays only have 140 characters ).

New sponsor, eh? ;)

But, yes, I suspect an indexer chain optimized to tiny docs can get sizable gains.

What change to the postings format? Is the change only in the RAM buffer or also in the index? If it's in the index... we should probably do this under flex.

It seems really fast. I spent a long time thinking about lock-free algorithms and data structures, so indexing performance should be completely independent of the search load (in theory). I need to think a bit more about how to make it work with "normal" documents and Lucene's current in-memory format.

Sounds like awesome progress!! Want some details over here :)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I got the basics of the term enum working, it can be completed fairly easily. So I moved on to term docs... There we got some work to do? Because we're not storing the skip lists in the ram buffer, currently. I guess we'll need a new FreqProxTermsWriterPerField that stores the skip lists as they're being written? How will that work? Doesn't the multi-level skip list assume a set number of docs?

Sounds like you & Michael should sync up!

Good question on skipping – for first cut we can have no skipping (and just scan)? Skipping may not be that important in practice, unless RAM buffer becomes truly immense. Of course, the tinier the docs the more important skipping will be...

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Good question on skipping - for first cut we can have no skipping (and just scan)?

True.

One immediate thought is to have a set skip interval (what was it before when we had single level?), and for now at least have a single level skip list. That we can grow the posting list with docs, and the skip list at the same time. If the interval is constant there won't be a need to rebuild the skip list.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Pre-advanced apology for permanently damaging (well I guess it can be deleted) the look and feel of this issue with a thwack of code, however I don't want to post the messy patch, and I'm guessing there's something small as to why the postings iteration on the freq byte slice reader isn't happening correctly (ie, it's returning 0).

public class DWTermDocs implements TermDocs {
    final FreqProxTermsWriterPerField field;
    final int numPostings;
    final CharBlockPool charPool;
    FreqProxTermsWriter.PostingList posting;
    char[] text;
    int textOffset;
    private int postingUpto = -1;
    final ByteSliceReader freq = new ByteSliceReader();
    final ByteSliceReader prox = new ByteSliceReader();

    int docID;
    int termFreq;

    DWTermDocs(FreqProxTermsWriterPerField field, FreqProxTermsWriter.PostingList posting) throws IOException {
      this.field = field;
      this.charPool = field.perThread.termsHashPerThread.charPool;
      //this.numPostings = field.termsHashPerField.numPostings;
      this.numPostings = 1;
      this.posting = posting;
      // nextTerm is called only once to 
      // set the term docs pointer at the 
      // correct position
      nextTerm();
    }

    boolean nextTerm() throws IOException {
      postingUpto++;
      if (postingUpto == numPostings)
        return false;

      docID = 0;

      text = charPool.buffers[posting.textStart >> DocumentsWriter.CHAR_BLOCK_SHIFT];
      textOffset = posting.textStart & DocumentsWriter.CHAR_BLOCK_MASK;

      field.termsHashPerField.initReader(freq, posting, 0);
      if (!field.fieldInfo.omitTermFreqAndPositions)
        field.termsHashPerField.initReader(prox, posting, 1);

      // Should always be true
      boolean result = nextDoc();
      assert result;

      return true;
    }

    public boolean nextDoc() throws IOException {
      if (freq.eof()) {
        if (posting.lastDocCode != -1) {
          // Return last doc
          docID = posting.lastDocID;
          if (!field.omitTermFreqAndPositions)
            termFreq = posting.docFreq;
          posting.lastDocCode = -1;
          return true;
        } else
          // EOF
          return false;
      }
      final int code = freq.readVInt();
      if (field.omitTermFreqAndPositions)
        docID += code;
      else {
        docID += code >>> 1;
        if ((code & 1) != 0)
          termFreq = 1;
        else
          termFreq = freq.readVInt();
      }
      assert docID != posting.lastDocID;
      return true;
    }

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't see anything obviously wrong – you excised this code from the same code that's used when merging the postings during flush?

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

The code is from FreqProxFieldMergeState which accepts in it's constructor FreqProxTermsWriterPerField. One difference is instead of operating on an array of posting lists, the code above assumes one posting list.

The numPostings was always 0 when testing

this.numPostings = field.termsHashPerField.numPostings;

In the code above it's hard coded to 1.

Maybe there's some initialization that's not happening correctly?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahh, I think it's because you're not calling compactPostings/sortPostings in the THPF, right?

Those methods collapse the hash table in-place (ie move all the nulls out), and sort.

So you have to re-work the code to not do that and instead use whatever structure you have for visiting terms in sorted order. Then stepping through the docs should just work, but, you gotta stop at the max docID, right?

Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes?

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Ahh, I think it's because you're not calling compactPostings/sortPostings in the THPF, right?

Those methods collapse the hash table in-place (ie move all the nulls out), and sort.

Yep, got that part.

So you have to re-work the code to not do that and instead use whatever structure you have for visiting terms in sorted order. Then stepping through the docs should just work, but, you gotta stop at the max docID, right?

Right, the terms in sorted order is working... The freq ByteSliceReader is reading nothing however (zeroes). Either it's init'ed to the wrong position, or there's nothing in there? Or something else.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Also wanted to add that the PostingList lastDocID is correct.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

I have a test case showing the term docs working... I'm going to try to add the term positions methods.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Basic term positions working, need to figure out how to do lazy loading payloads...

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

In thinking about the terms dictionary, we're going to run into concurrency issues right if we just use TreeMap? Can't we simply use the lock free ConcurrentSkipListMap? Yeah it's a part of Java6 however why reinvent the wheel?

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Payloads works (non-lazy loading), however ByteSliceReader doesn't implement a seek method so I think we simply need to load each payload as we increment nextPosition? The cost shouldn't be too much because we're simply copying small byte arrays (in the heap).

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Sounds like awesome progress!! Want some details over here :)

Sorry for not being very specific. The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! :) If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Hmm... what does JMM say about byte arrays? If one thread is writing to the byte array, can any other thread see those changes?

This is the very right question to ask here. Thread-safety is really the by far most complicated aspect of this feature. Jason, I'm not sure if you already figured out how to ensure visibility of changes made by the writer thread to the reader threads?

Thread-safety in our case boils down to safe publication. We don't need locking to coordinate writing of multiple threads, because of #3400. But we need to make sure that the reader threads see all changes they need to see at the right time, in the right order. This is IMO very hard, but we all like challenges :)

The JMM gives no guarantee whatsover what changes a thread will see that another thread made - or if it will ever see the changes, unless proper publication is ensured by either synchronization or volatile/atomic variables.

So e.g. if a writer thread executes the following statements:

public static int a, b;

...

a = 1; b = 2;

a = 5; b = 6;

and a reader threads does:

System.out.println(a + "," + b);

The thing to remember is that the output might be: 1,6! Another reader thread with the following code:

while (b != 6) {
  .. do something 
}

might further NEVER terminate without synchronization/volatile/atomic.

The reason is that the JVM is allowed to perform any reorderings to utilize modern CPUs, memory, caches, etc. if not forced otherwise.

To ensure safe publication of data written by a thread we could do synchronization, but my goal is it here to implement a non-blocking and lock-free algorithm. So my idea was it to make use of a very subtle behavior of volatile variables. I will take a simple explanation of the JMM from Brian Goetz' awesome book "Java concurrency in practice", in which he describes the JMM in simple happens-before rules. I will mention only three of those rules, because they are enough to describe the volatile behavior I'd like to mention here (p. 341)

Program order rule: Each action in a thread happens-before every action in that thread that comes later in the program order.

Volatile variable rule: A write to a volatile field happens-before every subsequent read of that same field.

Transitivity: If A happens-before B, and B happens-before C, then A happens-before C.

Based on these three rules you can see that writing to a volatile variable v by one thread t1 and subsequent reading of the same volatile variable v by another thread t2 publishes ALL changes of t1 that happened-before the write to v and the change of v itself. So this write/read of v means crossing a memory barrier and forcing everything that t1 might have written to caches to be flushed to the RAM. That's why a volatile write can actually be pretty expensive.

Note that this behavior is actually only working like I just described since Java 1.5. Behavior of volatile variables was a very very subtle change from 1.4->1.5!

The way I'm trying to make use of this behavior is actually similar to how we lazily sync Lucene's files with the filesystem: I want to delay the cache->RAM write-through as much as possible, which increases the probability of getting the sync for free! Still fleshing out the details, but I wanted to share these infos with you guys already, because it might invalidate a lot of assumptions you might have when developing the code. Some of this stuff was actually new to me, maybe you all know it already. And if anything that I wrote here is incorrect, please let me know!

Btw: IMO, if there's only one java book you can ever read, then read Goetz' book! It's great. He also says in the book somewhere about lock-free algorithms: "Don't try this at home!" - so, let's do it! :)

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Just to clarify, I think Mike's referring to ParallelArray?

http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/P arallelArray.html

There's AtomicIntegerArray: http://www.melclub.net/java/_atomic_integer_array_8java_source.html which underneath uses the sun.Unsafe class for volatile array access. Could this be reused for an AtomicByteArray class (why isn't there one of these already?).

A quick and easy way to solve this is to use a read write lock on the byte pool? Remember when we'd sync on each read bytes call to the underlying random access file in FSDirectory (eg, now we're using NIOFSDir which can be a good concurrent throughput improvement). Lets try the RW lock and examine the results? I guess the issue is we're not writing in blocks of bytes, we're actually writing byte by byte and need to read byte by byte concurrently? This sounds like a fairy typical thing to do?

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

A quick and easy way to solve this is to use a read write lock on the byte pool?

If you use a RW lock then the writer thread will block all reader threads while it's making changes. The writer thread will be making changes all the time in a real-time search environment. The contention will kill performance I'm sure. RW lock is only faster than mutual exclusion lock if writes are infrequent, as mentioned in the javadocs of ReadWriteLock.java

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

but my goal is it here to implement a non-blocking and lock-free algorithm. So my idea was it to make use of a very subtle behavior of volatile variables.

You're talking about having a per thread write buffer byte array, that on search gets copied into a read only array, or gets transformed magically into a volatile byte array? (Do volatile byte arrays work? I couldn't find a clear answer on the net, maybe it's stated in the Goetz book). If volatile byte arrays do work, an option to test would be a byte buffer pool that uses volatile byte arrays?

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

Do volatile byte arrays work

I'm not sure what you mean by volatile byte arrays?

Do you mean this?

volatile byte[] array;

This makes the reference to the array volatile, not the slots in the array.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

This makes the reference to the array volatile, not the slots in the array

That's no good! :)

If you use a RW lock then the writer thread will block all reader threads while it's making changes

We probably need to implement more fine grained locking, perhaps using volatile booleans instead of RW locks. Fine grained meaning on the byte array/block level. I think this would imply that changes are not visible until a given byte block is more or less "flushed"? This is different than the design that's been implicated, that we'd read from byte arrays as their being written to. We probably don't need to read from and write to the same byte array concurrently (that might not be feasible?).

The performance win here is probably going to be the fact that we avoid segment merges.

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks).

I think I have a lock-free solution working, which only syncs (i.e. does volatile writes) in certain intervals to not prevent JVM optimizations - but I need more time for thinking about all the combinations and corner cases.

It's getting late now - need to sleep!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks).

Right, I'm just not familiar specifically with what JMM says about one thread writing to a byte[] and another thread reading it.

In general, for our usage, the reader threads will never read into an area that has not yet been written to. So that works in our favor (they can't cache those bytes if they didn't read them). EXCEPT the CPU will have loaded the bytes on a word boundary and so if our reader thread reads only 1 byte, and no more (because this is now the end of the posting), the CPU may very well have pulled in the following 7 bytes (for example) and then illegally (according to our needs) cache them.

We better make some serious tests for this... including reader threads that just enum the postings for a single rarish term over and over while writer threads are indexing docs that occasionally have that term. I think that's the worst case for JMM violation since the #bytes cached is small.

It's too bad there isn't higher level control on the CPU caching via java. EG, in our usage, if we could call a System.flushCPUCache whenever a thread enters a newly reopened reader.... because, when accessing postings via a given Reader we want point-in-time searching anyway and so any bytes cached by the CPU are perfectly fine. We only need CPU cache flush when a reader is reopened....

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The prototype I'm experimenting with has a fixed length postings format for the in-memory representation (in TermsHash). Basically every posting has 4 bytes, so I can use int[] arrays (instead of the byte[] pools). The first 3 bytes are used for an absolute docID (not delta-encoded). This limits the max in-memory segment size to 2^24 docs. The 1 remaining byte is used for the position. With a max doc length of 140 characters you can fit every possible position in a byte - what a luxury! If a term occurs multiple times in the same doc, then the TermDocs just skips multiple occurrences with the same docID and increments the freq. Again, the same term doesn't occur often in super short docs.

The int[] slices also don't have forward pointers, like in Lucene's TermsHash, but backwards pointers. In real-time search you often want a strongly time-biased ranking. A PostingList object has a pointer that points to the last posting (this statement is not 100% correct for visibility reasons across threads, but we can imagine it this way for now). A TermDocs can now traverse the postinglists in opposite order. Skipping can be done by following pointers to previous slices directly, or by binary search within a slice.

This sounds nice!

This would be a custom indexing chain for docs guaranteed not to be over 255 positions in length right?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In thinking about the terms dictionary, we're going to run into concurrency issues right if we just use TreeMap?

Right, we need a concurrent data structure here. It's OK if there've been changes to this shared data structure since a reader was opened – that reader knows its max doc id and so it can skip a term if the first doc id in that term is > that max.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

I thought we're moving away from byte block pooling and we're going to try relying on garbage collection? Does a volatile object[] publish changes to all threads? Probably not, again it'd just be the pointer.

In the case of posting/termdocs iteration, I'm more concerned that the lastDocID be volatile than the with the byte array containing extra data. Extra docs is OK in the byte array because we'll simply stop iterating when we've reached the last doc. Though with our system, we shouldn't even run into this either, meaning a byte array is copied and published, perhaps the master byte array is still being written to and the same byte array (by id or something) is published again? Then we'd have multiple versions of byte arrays. That could be bad.

Because there is one DW per thread, there's only one document being indexed at a time. There's no writer concurrency. This leaves reader concurrency. However after each doc, we could simply flush all bytes related to the doc. Any new docs must simply start writing to new byte arrays? The problem with this is, unless the byte arrays are really small, we'll have a lot of extra data around, well, unless the byte arrays are trimmed before publication. Or we can simply RW lock (or some other analogous thing) individual byte arrays, not publish them after each doc, then only publish them when get reader is called. To clarify, the RW lock (or flag) would only be per byte array, in fact, all writing to the byte array could necessarily cease on flush, and new byte arrays allocated. The published byte array could point to the next byte array.

I think we simply need a way to publish byte arrays to all threads? Michael B. can you post something of what you have so we can get an idea of how your system will work (ie, mainly what the assumptions are)?

We do need to strive for correctness of data, and perhaps performance will be slightly impacted (though compared with our current NRT we'll have an overall win).

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

The tricky part is to make sure that a reader always sees a consistent snapshot of the index. At the same time a reader must not follow pointers to non-published locations (e.g. array blocks).

Right. In what case in the term enum, term docs chain of doc scoring would a reader potentially try to follow a pointer to a byte array that doesn't exist? I think we're strictly preventing it via last doc ids? Also, when we flush, I think we need to block further doc writing (via an RW lock?) and wait for any currently writing docs to complete, then forcibly publish the byte arrays, then release the write lock? This way we always have published data that's consistent for readers (eg, the inverted index can be read completely, and there won't be any wild writes still occurring to a byte array that's been published).

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

I thought we're moving away from byte block pooling and we're going to try relying on garbage collection? Does a volatile object[] publish changes to all threads? Probably not, again it'd just be the pointer.

We were so far only considering moving away from pooling of (Raw)PostingList objects. Pooling byte blocks might have more performance impact - they're more heavy-weight.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

To clarify the above comment, DW's update doc method would acquire a mutex. The flush bytes method would also acquire that mutex when it copies existing writeable bytes over to the readable bytes thing (pool?).

asfimport commented 14 years ago

Michael Busch (migrated from JIRA)

think we simply need a way to publish byte arrays to all threads? Michael B. can you post something of what you have so we can get an idea of how your system will work (ie, mainly what the assumptions are)?

It's kinda complicated to explain and currently differs from Lucene's TermHash classes a lot. I'd prefer to wait a little bit until I have verified that my solution works.

I think here we should really tackle #3400 first - it's a prereq. Wanna help with that, Jason?

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

I think the easiest way to test out the concurrency is to add a flush method to ByteBlockPool. Then allocate a read only version of the buffers array (not copying the byte arrays, just the 1st dimension pointers). The only issue is to rework the code to read from the read only array, and write to the write only array...

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Mike, can you clarify why intUptos and intUptoStart are member variables in TermsHashPerField? Can't the accessors simply refer to IntBlockPool for these? I'm asking because in IntBlockPool flush, for now I'm simply calling nextBuffer to shuffle the current writable array into a read only state (ie, all of the arrays being written to prior to flush will now be readonly).

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

I think the DW index reader needs to create a new fields reader on demand if the field infos have changed since the last field reader instantiation.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

intUptoStart is used in THPF.writeByte which is very much a hotspot when indexing, so I added it as a direct member in THPF to avoid an extra deref through the intPool. Could be this is harmless in practice though...

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Previously there was a discussion about DW index readers that stay open, but could refer to byte arrays that are recycled? Can't we simply throw away the doc writer after a successful segment flush (the IRs would refer to it, however once they're closed, the DW would close as well)? Then start with a new DW for the next batch of indexing for that thread?

apache / lucene

Search on IndexWriter's RAM Buffer [LUCENE-2312] #3388

3400

3649

4272

4472

3422