BytesHash [LUCENE-2662]

asfimport commented 14 years ago

This issue will have the BytesHash separated out from #3262

Migrated from LUCENE-2662 by Jason Rutherglen, resolved Nov 25 2010 Attachments: LUCENE-2662.patch (versions: 6) Linked issues:

3262
- 3776
- 3793

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

We need unit tests and a base implementation as BytesHash is abstract...

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

The current hash implementation needs to be separated out of TermsHashPerField.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Jason: I am confused... there is no hash impl in TermsHashPerField.

the hashing, and term encoding and other things, is the responsibility of the analysis chain (TermToBytesRefAttribute):

    // Get the text & hash of this term.
    int code = termAtt.toBytesRef(utf8);

this way, implementations can 'hash-as-they-go' like we do when encoding unicode char[] -> byte[], or they can simply return BytesRef.hashCode() if they don't have an optimized implementation.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

The THPF is hashing tokens for use in the indexing RAM buffer and the creation of postings, ie, the lookup of term byte[]s to term ids. The hash component is currently interwoven into THPF.

Here's some of the variables being used in THPF.

private int postingsHashSize = 4;
private int postingsHashHalfSize = postingsHashSize/2;
private int postingsHashMask = postingsHashSize-1;
private int[] postingsHash;

Also there's the methods rehashPostings, shrinkHash, postingEquals, and add(int textStart) has the lookup.

We'll probably also need to separate out the quick sort implementation in THPF, I'll add that to this issue.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Jason: what I am saying is if i look at the method in your patch:

public T add(BytesRef bytes)

the first thing it does is compute the hash, but this is already computed in the analysis chain.

why not have

public T add(BytesRef bytes, int hashCode)

and also:

public T add(BytesRef bytes) {
  return add(bytes, bytes.hashCode());
}

then we can avoid computing this twice, and keep the optimization in UnicodeUtil

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Ah, ok, I didn't write this code, I extracted it from #3262, and nice, you reviewed it can be improved. I'll make changes to it shortly, hopefully.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

jason, can you please hold off with this since I have newer / different versions of this class already with tests etc. I understand that you need that class but creating all these issues and rushing ahead is rather counter productive.

@Robert: this class is standalone in this patch and doesn't know about the analysis chain. But thanks for the comments I will incorporate them.

simon

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Simon, when do you think you'll be posting?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Simon, when do you think you'll be posting?

maybe within the next week I have a busy schedule but does this patch keep you from doing any work? You shouldn't just pull out stuff from 1 month old patches especially as you don't even give me time to reply on the orig. discussion.

Any rush on this?

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

It'd be nice to get deletes working, ie, #3729 and move forward in a way that's useful long term. What changes have you made?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

This patch contains a slightly different version of BytesHash (renamed it to BytesRefHash but that is to be discussed - while writing this I actually think BytesHash is the better name). BytesRefHash is now final and does not create Entry objects anymore. Internally it maintains two integer arrays one acting as the hash buckets and the other one contain the bytes-start offset in the ByteBlockPool. Each added entry is assigned to an increasing ordinal since this is what Entry is used in almost all use-cases (in CSF though). For TermsHashPerField this is also "native" since is uses the same kind of referencing system.

These changes keep this class as efficient as possible, keeping GC costs low and allows JIT to do better optimizations. IMO this class is super performance critical and since we recently refactored indexing towards parallel arrays adding another "object" array might not be the way to go anyway.

I also incorporated robers comments - thanks for the review anyway. I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

comments / review welcome!!

One more thing, I did not move ByteBlockPool to o.a.l.utils but I thing it belongs there, thoughts?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I guess that is the first step towards factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

I think it would be better if this class were used in the patch... i wouldn't commit it by itself unused. Its difficult for people to review its behavior, since its just a standalone unused thing (for instance, the hashCode thing i brought up)

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

> BytesRefHash is now final and does not create Entry objects anymore

That's good.

> move ByteBlockPool to o.a.l.utils

Sure why not.

> factoring it out of TermsHashPerField, the next question is are we gonna do that in a different issue and get this committed first?

We need to factor it out of THPF otherwise this patch isn't really useful for committing. Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

We need to factor it out of THPF otherwise this patch isn't really useful for committing. Also, it'll get tested through the entirety of the unit tests, ie, it'll get put through the laundry.

Yeah, lets see this as the first baby step towards it. I will move ByteBockPool to o.a.l.utils and start cutting THPF over to it. We need to do benchmarking in any case just to make sure JIT doesn't play nasty tricks with us again.

simon

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

make sure JIT doesn't play nasty tricks with us again.

What would we do if this happens?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Simon – some ideas:

In the class jdocs, I think state that this is basically a Map<BytesRef,int>?
Maybe we also move ByteBlockPool --> oal.util?
Maybe move out the ByteBlockAllocator to its own class (in util)? RecyclingByteBlockAllocator?
Can we have DocumentsWriter share the ByteBlockAllocator? (Right now it's dup'd code since DW also implements this).
Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space).
We should advertise the limits in the jdocs – limited to <= 2GB total byte storage, each key must be <= BLOCK SIZE-2 in length.
Can we have sortedEntries() not allocate a new iterator object? Ie, just return the sorted bytesStart int[]? (This is what's done today, and, for term vectors on small docs, this method is pretty hot). And the javadocs for this should be stronger – it's not that the behaviour is undefined after, it's that you must .clear() after you're done consume the sorted entries.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

make sure JIT doesn't play nasty tricks with us again.

What would we do if this happens?

Cry?

Or... install Harmony and see if it has the same problem and if so submit a patch to them to fix it :)

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

In the class jdocs, I think state that this is basically a Map<BytesRef,int>?

yeah that simplifies it - will do.

Maybe we also move ByteBlockPool --> oal.util?

yeah I did that already - that makes totally sense

Maybe move out the ByteBlockAllocator to its own class (in util)? RecyclingByteBlockAllocator?

+1 yeah I like that - I also think we should allow to pass the blockpool to the byteshash instead of the allocator. From what I can tell now I think this is necessary for the refactoring anyway since we share pools with secondary TermsHash instances in the termvector case.

Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space).

yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change.

We should advertise the limits in the jdocs - limited to <= 2GB total byte storage, each key must be <= BLOCK SIZE-2 in length.

I think I have done the latter already but I will add the other too.

Can we have sortedEntries() not allocate a new iterator object? Ie, just return the sorted bytesStart int[]? (This is what's done today, and, for term vectors on small docs, this method is pretty hot). And the javadocs for this should be stronger - it's not that the behaviour is undefined after, it's that you must .clear() after you're done consume the sorted entries.

Ah I see - good point. I think what you refer to is public int[] sort(Comparator<BytesRef> comp) - the iterator one is just more convenient one. I will change though.

thanks mike!

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Attaching my current state for feedback and iteration.

factored out ByteBlockAllocator from DocumentsWriter
moved ByteBlockPool to o.a.l.util
added RecyclingByteBlockAllocator which can be used with or without synchronization. IMO the DummyConcurrentLock will be optimized away so that his might be super low cost. - feedback for that would more than welcome.
addressed all the comments from mike - thanks again
added more tests
cut over constants from DocumentsWriter to ByteBlockPool

TermsHashPerField is next.... feedback welcome.

simon

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

An API change to BBP that would be useful is instead of passing in the "size in bytes" to newSlice, it'd be more useful if the level and/or the size were passed in. In fact, throughout the codebase, a level, specifically the first, is all that is passed into the newSlice method. The utility of this change is, I'm recording the level of the last slice for the skip list in #3388.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Simon, the patch looks like it's ready for the next stage, ie, TermsHashPerField deparchment.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

We are almost there. I factored out ByteRefHash out of TermsHashPerField just having two "nocommit" parts left in the code I need to find a solution for.

there needs to be a way to communicate the byte usage up to DocumentsWriter which I haven't explored yet
textStarts in ParallelPostingsArray needs to be replaced since it is already maintained in ByteRefHash. I will need to look closer into that but suggestions are welcome. One way to do it would be to attach a reference to BRH instead of the textStart - but that is a naive suggestion since I haven't looked into that in more detail.

All tests are passing so far and TermsHashPerField looks somewhat cleaner. I will work on fixing those nocommits and run some indexing perf test against the patch.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe rename ords -> keys? And hash -> values? (The key isn't really an "ord" (I think?) because it increases by more than 1 each time... it's more like an address since it references an address in the byte-pool space).

yeah that depends how you see it - the array index really is the ord though. but I like those names. I will change.

Duh, I agree – the new names are confusing!! Sorry. I was confused... you are right that what's now called "keys" are in fact really ords! They are always incr'd by one, on adding a new one.

How about renaming key back to ord? And then maybe rename values to bytesStart? And in their decls add comments saying they are indexed by hash code? And maybe rename addByOffset -> addByBytesStart?

On the nocommit in ByteBlockPool – I think that's fine? It's an internal class....
The nocommit in BytesRefHash seems wrong? (Ie, compact is used internally)... though maybe we make it private if it's not used externally?
On the "nocommit factor this out!" in THPF.java... I agree, the postingsArray.textStarts should go away right? Ie, it's a [wasteful] copy of what the BytesRefHash is already storing?
Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe AtomicInt – none of these classes can address > 2GB)? Then the pool would add in blockSize every time it binds a new block. That method (DW.bytesUsed) is called alot – at least once on every addDoc.
I'm confused again – when do we use RecyclingByteBlockAllocator from a single thread...? Ie, why did the sync need to be conditional for this class, again....? It seems like we always need it sync'd (both the main pool & per-doc pool need this)? If so we can simplify and make these methods sync'd?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

How about renaming key back to ord? And then maybe rename values to bytesStart? And in their decls add comments saying they are indexed by hash code? And maybe rename addByOffset -> addByBytesStart?

I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset? The names ord and bytesStart are a good compromise :) lets shoot for that.

On the nocommit in ByteBlockPool - I think that's fine? It's an internal class....

you refer to this: // nocommit - public arrays are not nice! ? yeah that more of an style thing but if somebody changes them its their fault for being stupid I guess.

The nocommit in BytesRefHash seems wrong? (Ie, compact is used internally)... though maybe we make it private if it's not used externally?

Ah yeah thats bogus - its from a previous iteration which was wrong as well, I will remove.

On the "nocommit factor this out!" in THPF.java... I agree, the postingsArray.textStarts should go away right? Ie, it's a [wasteful] copy of what the BytesRefHash is already storing?

Yeah that is the reason for that nocommit. Yet, I though about this a little and I have two options for this.

we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it. BytesRefHash would accept this class, don't have a good name for it but lets call it TextStartArray for now, and use it internally. It would call grow() once needed inside BytesRefHash and all the other code would be unchanged since PPA is a subclass.
the other way would be to bind the ByteRefHash to the postings array which seems odd to me though.

More ideas?

Can we impl BytesRefHash.bytesUsed as an AtomicLong (hmm maybe AtomicInt - none of these classes can address > 2GB)? Then the pool would add in blockSize every time it binds a new block. That method (DW.bytesUsed) is called alot - at least once on every addDoc.

I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean?

I'm confused again - when do we use RecyclingByteBlockAllocator from a single thread...? Ie, why did the sync need to be conditional for this class, again....? It seems like we always need it sync'd (both the main pool & per-doc pool need this)? If so we can simplify and make these methods sync'd?

man, I am sorry - I thought I will use this in #3262 in a single threaded env but if so I should change it there if needed. I was one step ahead though. I will change and maybe have a second one if needed. Agree?

simon

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't like addByBytesStart I would like to keep offset since it really is an offset into the pool. addByPoolOffset? The names ord and bytesStart are a good compromise lets shoot for that.

OK!

we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it.

This seems good? So, this would be the "store" that BRH manages... and by subclassing it you can have other parallel arrays storing anything, indexed by ord.

I did exactly that in the not yet uploaded patch. But I figured that it would maybe make more sense to use that AtomicInt in the allocator as well as in THPF or is that what you mean?

I think we should use it everywhere to track bytes used ;)

man, I am sorry - I thought I will use this in #3262 in a single threaded env but if so I should change it there if needed. I was one step ahead though.

I will change and maybe have a second one if needed. Agree?

Ahh that's right I forgot the whole driver for this refactoring heh ;) Yeah I think leave it sync'd for now and we can test if this hurts perf in #3262? "Supposedly" uncontended locks are low-cost (but I'm not sure...).

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

we could factor out a super class from ParallelPostingArray which only has the textStart int array, the grow and copy method and let ParallelPostingArray subclass it.

This option, makes the most sense. ParallelByteStartsArray?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Next iteration - seems to be very close!

I have applied the following changes:

introduces a AtomicLong to track bytesUsed in DocumetnsWriter, TermsHashPerField, ByteRefHash and RecyclingByteBlockAllocator
Factored out a BytesStartArray class from BytesRefHash that manages the int[] holding the bytesStart offsets. TermsHashPerField subclasses and manages the ParallelPostingsArray through it.
remove remaining no-commits
made RecyclingbyteBlockAllocator synced by default (we use synchronized methods for it now)

I run a quick Wikipedia 100k docs benchmark against trunk vs. LUCENE-2662 and the results are promising.

version	rec/sec	elapsed sec	avgUsedMem
LUCENE-2662	717.30	139.41	536,682,592
trunk	682.66	146.49	546,065,344

I will run the 10M benchmark once I get back to this.

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Simon, looks good.

Are we using:

public int add(BytesRef bytes, int code)

yet?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Are we using:...

yeah, look at TermsHashPerFields add() method

       termID = bytesHash.add(termBytesRef, termAtt.toBytesRef(termBytesRef));

simon

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I indexed 10M 1KB wikipedia docs, single threaded, and also see things a bit faster w/ the patch (10,353 docs/sec vs 10,182 docs/sec). Nice to have a refactor improve performance for a change, heh.

The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly).

Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Still, the resulting indices had identical structure (ie we seem to flush at exactly the same points), so I think bytes used is properly tracked.

Sorry, scratch that – I was inadvertently flushing by doc count, not by RAM usage. I'm re-running w/ flush-by-RAM to verify we flush at exactly the same points as trunk.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array?

Also maybe rename pos -> freeCount? (pos is a little too generic?)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Simon, thank you for renaming the 'utf8' variables here.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Simon, thank you for renaming the 'utf8' variables here.

YW :)

In RecyclingByteBlockAllocator.recycleByteBlocks, if we cannot recycle all of the blocks (ie because it exceeds maxBufferedBlocks), we are failing to null out the entries in the incoming array?

Ahh you are right - I will fix.

Also maybe rename pos -> freeCount? (pos is a little too generic?)

I mean its internal though but I see your point.

thanks for reviewing it closely.

The avgUsedMem was quite a bit higher (1.5GB vs 1.0GB), but, I'm not sure this stat is trustworthy.... I'll re-run w/ infoStream enabled to see if anything looks suspicious (eg, we are somehow not tracking bytes used correctly).

hmm I will dig once I get back to my workstation.

simon

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK my 2nd indexing test (10M wikipedia docs, flush @ 256 MB ram used) finished and trunk/patch are essentially the same throughput, and, all flushes happened at identical points. So I think we are good to go...

Nice work Simon!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I also ran a test w/ 5 threads – they are close (22,402 docs/sec for patch, 22,868 docs/sec for trunk), and this time avgUsedMem is closer (811 MB for trunk, 965 MB for patch).

I don't think the avgUsedMem is that meaningful – it takes the max of Runtime.totalMemory() - Runtime.freeMemory() (which includes garbage I think), after each completed task, and then averages across all tasks. In my case I think it's averaging 1 measure per thread, so it's really sort of measuring how much garbage there happened to be at the time.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I instrumented trunk & the patch to see how many times we do new byte[bufferSize] while building 5M index, and they both alloc the same number of byte[] from the BBA. So I don't think we have a memory issue...

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

This patch fixes nulling out the recycled but not reused byte blocks in RecyclingByteBlockAllocator.

I thing we are ready to go I will commit to trunk soon. I don't think we need a CHANGES.TXT here - at least I can not find any section this refactoring would fit to.

simon

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Committed to trunk in rev. 1003790

@Jason: do you need that merged into Realtime-Branch or is buschmi going to do that? Otherwise I can help too

I will keep it open until this is merged into Realtime Branch

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate. This'll probably be best.

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Simon, I'm going to get deletes working, tests passing using maps in the RT branch, then we can integrate. This'll probably be best.

Jason, I suggest you create a separate issue something like "Integrate BytesRefHash in Realtime Branch" and I will take care of it. I think this issue had a clear target to factor out the hash table out of TermsHashPerField and we should close it. lets use a new one to track the integration.

Thoughts?

Simon

asfimport commented 14 years ago

Jason Rutherglen (migrated from JIRA)

Lets commit this to trunk. We need to merge in all of trunk to the RT branch, or vice versa at some point anyways. This patch could be a part of that bulk merge-in, or we can simply do it now.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This was already committed to trunk...

asfimport commented 14 years ago

Mathias Walter (migrated from JIRA)

Why is this issue still open, if the patch was already committed to trunk?

asfimport commented 14 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Why is this issue still open, if the patch was already committed to trunk?

see my comment above:

I will keep it open until this is merged into Realtime Branch

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I will keep it open until this is merged into Realtime Branch

I think we should really close this since RT branch is not very active right now....

asfimport commented 13 years ago

Michael Busch (migrated from JIRA)

I think we should really close this since RT branch is not very active right now....

Sorry about that. I need to merge trunk into RT, then I'll get this change too. It's a big merge though with tons of conflicts...

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Sorry about that. I need to merge trunk into RT, then I'll get this change too. It's a big merge though with tons of conflicts...

HA! good to see you here! :) have fun with the merge

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

HA! good to see you here! have fun with the merge

He is working hard, it's 4:45 am in California :-)

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

He is working hard, it's 4:45 am in California

true but he is in germany :D

apache / lucene

BytesHash [LUCENE-2662] #3736

3262

3776

3793