chriseldredge / Lucene.Net.Linq

LINQ provider to run native queries on a Lucene.Net index
Other
151 stars 66 forks source link

Session.Add overload for new documents #72

Closed chriseldredge closed 9 years ago

chriseldredge commented 9 years ago

Committing a session deletes any documents by key whether they exist or not.

This uses as much or more CPU time as adding new documents, so when a client knows that a document is not already in the index, an optimization would be to skip deleting documents.

pdoran commented 9 years ago

Chris, presumably this enhancement would allow for better performing reindexing. The behavior I saw was as you are indexing new data in to the system, in a reindex operation, if you use the session to insert segments of the data, commit and reopen when a new segment is ready, it appears that the index operation begins to slow down. Looking at the code I believe this is because we are searching over a growing index (to delete existing items) as we try and add new items. I think adding the overload described in this issue would really help.

In your experience, what is the best way to reindex with Lucene.Net.Linq? Are you accepting pull requests for this issue?

chriseldredge commented 9 years ago

In the case of reindexing, the client may be replacing existing documents, and such an overload would be inappropriate and dangerous. This overload is proposed for initial indexing, or advanced cases where the client has some way of knowing that a document with a given ID is definitely not already present in the index.

Your question is a valid one though. I've noticed that deleted documents tend to stick around in my applications and my indexes explode up to 15GB. After optimizing the index it shrinks back down to 400MB. The default settings for merge factor do not seem to be working well on Lucene.Net 3.0.3. I haven't spent much time trying to figure out why or what should be done.

The best option I'm aware of at the moment, is to periodically call LuceneDataProvider.IndexWriter.Optimize(), perhaps after completing a reindexing operation.

pdoran commented 9 years ago

Yes, good point on the expected behavior from the client. Interestingly enough, when we first started using the library we didn't realize it was deleting existing documents from the index. So in our code when we are doing updates we were calling query and delete. That meant we incurred the cost of searching for those documents twice.

I took a stab at implementing the enhancement by placing a new operation on the ISession.AddWithoutDelete(params T[] items); An overload on Add did not feel quite right as you would have some boolean flag as the first param and then a params of T. After letting it set in for a while I felt that perhaps setting a session Add Operation Intent when you open the session may be a little more clear. It would help those of us new to the library to know under the hood an Add is really an AddOrUpdate. If you selected an intent on the session of AddWithoutDelete it would essentially skip the step of marking the added documents for delete.

From what I've read, Optimize is bad, and that ideally you would let Lucene manage it's merges when it feels it can (based on the MergePolicy/MergeFactor). A MergeFactor of 10 (the default) and a RamBufferSize of 16mb (default from IndexWriter.DEFAULT_RAM_BUFFER_SIZE_MB) means that your indexes would go 160mb -> merge, 1,600 mb -> merge, 16,000 mb -> merge. So your example of 15gb -> 400mb might be explained by the defaults (it had not quite hit the 16,000mb size and therefore a merge had not been triggered).

Curious on your thoughts.

chriseldredge commented 9 years ago

The intent of Lucene.Net.Linq is that each Document/Object be uniquely identifiable. Lucene.Net imposes no such constraint, but Lucene.Net.Linq does additional book keeping to ensure queries behave intuitively, as in the case of provider.AsQueryable<Foo>().SingleOrDefault(i => i.Key == "something") being guaranteed to find at most one record.

That's why this feature is an advanced performance idea and I still have reservations about adding it. I can imagine clients not understanding the implications and trying to use it because the comments say it's faster.

Using an AddOperationIntent enum may be a good choice, though I wonder which method names communicate the most explicitly that you'd better know what you're doing to use the non-default behavior.

pdoran commented 9 years ago

Isn't naming things one of the hardest problems in computer science?

All jokes aside, as I thought about it a little yesterday, would providing an OpenSession overload that required a user to provide some sort of implementation of a new interface: ICommitBehavior provide a little more safety and the flexibility desired by this issue? It always seems to me that when you require someone to implement an interface, find that particular overload and then test against their assumptions with the interface, you typically weed out the naive user. Plus, that follows the open closed principal nicely.

I tried to wrap my head around what the interface would look like, but I couldn't really come to a cohesive solution. I suppose that might be because the algorithm to handle the internal commit is fairly specific and doesn't really have any logical points at which exposing the behavior externally, through this interface, would really be desired.

Adding an overload to the OpenSession gets fairly cumbersome due to the fact that there are already a fair amount of overloads. I think either adding an additional method to ISession (as I did in my commit here: https://github.com/pdoran/Lucene.Net.Linq/commit/45afaa1bb7a25a5139fed4498c03fd27093e35ea) or perhaps providing an overload with an Enum are the two better options.

Alternatively, you could also add a new type of session both implementing ISession, but with different behavior. Similar to nHibernate having the StatelessSession and Session. I don't know what a good name would be though (naming problem again). You could move most of the LuceneSession in to an abstract base class and then implement two inherited types DeleteOnAddLuceneSession NoDeleteOnAddLuceneSession, implementing one or two abstract methods.

chriseldredge commented 9 years ago

I agree that boolean parameters to methods are bad for readability, but I'm finding that I prefer an overload of Add that takes an OperationIntent enum as the first parameter. The existing Add method would simply delegate to the more explicit method with the existing default behavior.

I'm still toying with the name OperationIntent. What do you think about:

public enum KeyConstraint
{
  None,
  Unique
}

The above example would include xml comments to explain where the enum is used, what the default is and when it would be appropriate to use KeyContraint.None.

The problem with subclassing Session or creating overloads is that a client may want to mix behaviors in the a single session. My primary consumer, NuGet.Lucene would definitely need both behaviors available on a single session.

pdoran commented 9 years ago

KeyConstraint sounds really nice. It gives across the idea of what you really intend for the session to handle (by default we make sure keys are unique), however you can override that behavior by ignoring the unique constraint.

Yes, having mixed behavior in a single session is much nicer. I know in our use case we would actually prefer that.

chriseldredge commented 9 years ago

Cool. If could apply those changes to your commit, squash it and submit a PR I would be happy to merge.

pdoran commented 9 years ago

I'll get that in today!

chriseldredge commented 9 years ago

Commit b925f4e26ffc870221892d81b7c7c2e76d2ba481 brings this to master. I'll keep this issue open until a release is published.

chriseldredge commented 9 years ago

Released in https://github.com/themotleyfool/Lucene.Net.Linq/releases/tag/v3.5.0