Some house cleaning in addIndexes* [LUCENE-2455]

asfimport commented 14 years ago

Today, the use of addIndexes and addIndexesNoOptimize is confusing - especially on when to invoke each. Also, addIndexes calls optimize() in the beginning, but only on the target index. It also includes the following jdoc statement, which from how I understand the code, is wrong: After this completes, the index is optimized. – optimize() is called in the beginning and not in the end.

On the other hand, addIndexesNoOptimize does not call optimize(), and relies on the MergeScheduler and MergePolicy to handle the merges.

After a short discussion about that on the list (Thanks Mike for the clarifications!) I understand that there are really two core differences between the two:

addIndexes supports IndexReader extensions
addIndexesNoOptimize performs better

This issue proposes the following:

Clear up the documentation of each, spelling out the pros/cons of calling them clearly in the javadocs.
Rename addIndexesNoOptimize to addIndexes
Remove optimize() call from addIndexes(IndexReader...)
Document that clearly in both, w/ a recommendation to call optimize() before on any of the Directories/Indexes if it's a concern.

That way, we maintain all the flexibility in the API - addIndexes(IndexReader...) allows for using IR extensions, addIndexes(Directory...) is considered more efficient, by allowing the merges to happen concurrently (depending on MS) and also factors in the MP. So unless you have an IR extension, addDirectories is really the one you should be using. And you have the freedom to call optimize() before each if you care about it, or don't if you don't care. Either way, incurring the cost of optimize() is entirely in the user's hands.

BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy :).

Migrated from LUCENE-2455 by Shai Erera (@shaie), resolved May 27 2010 Attachments: index.31.cfs.zip, index.31.nocfs.zip, LUCENE-2455_3x.patch (versions: 5), LUCENE-2455_trunk.patch

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Remove optimize() call from addIndexes(IndexReader...)

This still makes me nervous. Yeah it's bad that this method does optimize() now. But if we remove it, it's bad that this method can attempt to do a ridiculously immense merge, since it [naively] just stuffs everything and and does one merge. Ie, both at are bad.

Maybe... we could do this: only merge the the incoming IndexReaders, appending a new segment to the end of the index? Ie do no merging whatsoever of the current segments in the index.

Yes, this can result in "unbalanced" segments (ie, a huge segment appears after the long tail of level 0 segments), but, the merge policy can handle this – it'll work out whatever merges are then necessary to get this segment onto the level that roughly matches its size.

So unless you have an IR extension, addDirectories is really the one you should be using.

You mean addIndexes(Directory..)?

BTW, addIndexes(IndexReader...) does not use neither the MergeScheduler nor MergePolicy, but rather call SegmentMerger directly. This might be another place for improvement. I'll look into it, and if it's not too complicated, I may cover it by this issue as well. If you have any hints that can give me a good head start on that, please don't be shy .

This would be best of all :) But it's tricky, because our MP/MS assume they are working w/ a SegmentInfo. But, maybe it could somehow be made to work – eg IR does give us maxDoc, numDocs (so we can know del doc count). But eg LogByteSizeMergePolicy goes and computes total byte size of the segment (via SegmentInfo) which we cannot do from an IR.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

You mean addIndexes(Directory..)?

Yes, copy-paste error.

Maybe... we could do this: only merge the the incoming IndexReaders, appending a new segment to the end of the index?

I like it. IMO, that's what the method should do anyway, for better performance and service to the users. If I'm adding indexes, that doesn't mean I want a whole merge process to kick off. If I want that, I can call maybeMerge or optimize afterwards.

Basically, what I would like to add (and I'm not sure it belongs to this issue) is a "super fast" addIndexes method, something like registerIndexes, which doesn't even traverses the posting lists, removes deleted docs etc. - simply registering the new segments in the Directory. If needed - do a bulk copy of the files and update segments*. Simple as that. Maybe it does fit in that issue, as part of the general "house cleaning"?

I will look more closely into supporting MP + MS w/ addIndexes(readers). Can't promise anything as I learn the code as I go :).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I agree, addIndexes should be minimal in the work it does...

But bulk copy of the files isn't really possible for addIndexes(IR...) in general, since the readers can be arbitrary (eg FilterIndexReader).

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok. But since addIndexes(IR) is for IR extensions only, I think the number of people tha will be limited by it is very low.

But, why wouldn't they be able to use the Directory... version of the method? Since it's a bulk copy, we don't need IR methods. Maybe just call dir.copyTo or something of that sort? The method will only be asked to copy files (in case they exist elsewhere). I was thinking of introducing just a Directoy version of such method.

Basically, if you use NoMP and call addIndexesNoOptimize today, you get half of what I want, as only resolveExternals will be called. What I want is for the resolveExternals to be even faster, plain and shallow "resolution".

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But, why wouldn't they be able to use the Directory... version of the method?

Adding indexes using FilterIndexReader is useful – eg look @ how the multi-pass index splitter tool works.

What I want is for the resolveExternals to be even faster, plain and shallow "resolution".

For addIndexes(Directory), assuming the codecs are identical (the "write" codec equals the codec used to write the external segment), and assuming the doc stores of the external segment are private to it, I think we should be able to do a straight file-level copy, but renaming the segment in the process?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Adding indexes using FilterIndexReader is useful

I'm not against that Mike. addIndexes should allow for both IndexReader and Directory. It's the registerIndexes (or whatever name we come up with) which should work with Directory only, and then, even if the app calls addIndexes with its own custom IR, it can still call registerIndexes w/ the Directory only, to do that fast copy/registration. Since no IR method will be involved in the process.

So let's not confuse the two - addIndexes will exist and work as they are today. registerIndexes will be a new one.

assuming the codecs are identical (the "write" codec equals the codec used to write the external segment), and assuming the doc stores of the external segment are private to it

Right. Thanks for pointing that out, as it will become an important NOTE in the documentation. This method (registerIndexes) is definitely for advanced users, that have to know exactly what's in the foreign indexes. For example, I need this because I'm building several indexes on several nodes and then I want to add them to a central/master one. I know they don't have deletions, and each is already optimized. Therefore traversing the posting lists (as fast as it would be) is completely unnecessary.

but renaming the segment in the process?

Sure! I think we should really 'register' them in the Directory, as if they are the newly flushed segments. I'm sure you have a general idea on how this can be done? Assuming through SegmentInfos or something?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

While changing addIndexes(reader), I've noticed it first obtains read lock and then calls startTransaction(true). In between it calls flush + optimize, which I've removed (as we no longer want to do that). When I ran the tests, TestIndexWriter.testAddIndexesWithThreads failed on the assert in startTransaction about numDocsInRam != 0. That's expected as I no longer call flush. The failure does not occur always.

In addIndexes(Dir) flush is called before startTransaction. But it makes sense to do it there, as the local segments are also merged. In the new addIndexes(reader) they won't and so I wonder if:

I shouldn't call startTransaction at all, or
I should, but also call flush before?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch handles the changes for 3x:

addIndexesNoOptimize deprecated - new addIndexes(Directory...) instead
CHANGES updates
addIndexes does not do optimize before
TestAddIndexesNoOptimize renamed to TestAddIndexes
Changed calls to addIndexesNoOpt to use addIndexes(Dir...) (except for backwards tests)
Changed textual references to addIndexesNoOpt.

You should "svn mv lucene/src/test/org/apache/lucene/index/TestAddIndexesNoOptimize.java lucene/src/test/org/apache/lucene/index/TestAddIndexes.java" before you apply the patch.

The patch is much smaller than it looks - it's the rename of TestAddIndexesNoOpt that takes a lot of space.

As I've mentioned before, I'm not sure about addIndexes calling startTransaction w/o flushing. Even though the tests pass, this seems wrong. So I'd appreciate a review.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think you should still call flush, and still start/commitTransaction – addIndexes is "all or nothing", which is why we have those transaction methods. Ie, on exception, the rollbackTransaction puts the original segments back.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Shai! Only a small typo in CHANGES (unles -> unless).

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I see. I understand why it's called in addIndexes(Dir), because the local segments are also touched. But now in the Reader version, they aren't. So it looked odd to me that we flush whatever is in RAM. I think you said once that addIndexes should have done the merge outside, adding the new segment when it's done?

But if you think that flushing the RAM, even though its content is touched, is ok then I'll change it.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I understand why it's called in addIndexes(Dir), because the local segments are also touched. But now in the Reader version, they aren't. So it looked odd to me that we flush whatever is in RAM.

Yeah maybe we should no longer flush (but still call start/commitTransaction). I think there may've been a reason to flush first (besides that we were also merging local segments)... but I can't remember it. If you comment out that assert (and the corresponding assert for deletions) do any tests fail?

I think you said once that addIndexes should have done the merge outside, adding the new segment when it's done?

Yes, I would love to fix this – it'd mean we would not need the start/commit/rollbackTransaction code.

Ie, we play a dangerous game now, where addIndexes is allowed to muck with the in-memory SegmentInfos before it's complete. It'd be better if all merging happened outside of its SegmentInfos, and only when addIndexes finished, it'd atomically commit to SegmentInfos.

This would then allow commit() to run immediately, not having to wait for any running addIndexes to finish first. And we would not need to block add/updateDocument nor deleteDocuments while addIndexes is running.

So, actually, I think in addIndexes(IR...) you should not use the transaction logic at all? Just do the merge externally & commit in the end? (And try not flushing as well.).

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ha :), I knew this will get more complicated and interesting ... So basically, it feels to me that if we'd have registerIndexes, we could in addIndexes merge outside IW and then call register?

So far, tests pass w/ startTransaction. But that test is multi-threaded so it may be a concurrency issue. I'll try to do the addIndexes outside IW and then commit the new segment. If that will be straightforward, then I think I'll understand better how to develop registerIndexes.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ha , I knew this will get more complicated and interesting ..

It always does!

So basically, it feels to me that if we'd have registerIndexes, we could in addIndexes merge outside IW and then call register?

Hmm but register will be an external API for copying over segments in a foreign directory, right? (And segment must be renamed).

Vs these segments which will be in our directory already, with the right segment name, and just need to be committed to the segmentInfos?

So far, tests pass w/ startTransaction.

You mean w/o the flush?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

You mean w/o the flush?

Yes. The start/commit transaction looks like that:

startTransaction(false);

try {
  mergedName = newSegmentName();
  merger = new SegmentMerger(this, mergedName, null);

  for (IndexReader reader : readers)      // add new indexes
    merger.add(reader);

  int docCount = merger.merge();                // merge 'em

  synchronized(this) {
    info = new SegmentInfo(mergedName, docCount, directory, false, true,
                  -1, null, false, merger.hasProx());
    setDiagnostics(info, "addIndexes(IndexReader...)");
    segmentInfos.add(info);
  }

  // Notify DocumentsWriter that the flushed count just increased
  docWriter.updateFlushedDocCount(docCount);

  success = true;
} finally {
  if (!success) {
    rollbackTransaction();
  } else {
    commitTransaction();
  }
}

A new segment name is generated
All readers but the current one are merged
The new SI is added to the writer's SIs
DocWriter's updateFlushedDocCount is updated
The transaction is committed or rolled back if there was an error.

So this looks like it already does the merge "on the side" and when it's done the new segment is registered?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

In fact, I've create newAddIndexes (just for the review) which works like that:

  public void newAddIndexes(IndexReader... readers) throws CorruptIndexException, IOException {

    ensureOpen();

    try {
      String mergedName = newSegmentName();
      SegmentMerger merger = new SegmentMerger(this, mergedName, null);

      for (IndexReader reader : readers)      // add new indexes
        merger.add(reader);

      int docCount = merger.merge();                // merge 'em

      SegmentInfo info = null;
      synchronized(this) {
        info = new SegmentInfo(mergedName, docCount, directory, false, true,
            -1, null, false, merger.hasProx());
        setDiagnostics(info, "addIndexes(IndexReader...)");
        segmentInfos.add(info);
      }

      // Notify DocumentsWriter that the flushed count just increased
      docWriter.updateFlushedDocCount(docCount);

      // Now create the compound file if needed
      if (mergePolicy instanceof LogMergePolicy && getUseCompoundFile()) {

        List<String> files = null;

        synchronized(this) {
          // Must incRef our files so that if another thread
          // is running merge/optimize, it doesn't delete our
          // segment's files before we have a chance to
          // finish making the compound file.
          if (segmentInfos.contains(info)) {
            files = info.files();
            deleter.incRef(files);
          }
        }

        if (files != null) {
          try {
            merger.createCompoundFile(mergedName + ".cfs");
            synchronized(this) {
              info.setUseCompoundFile(true);
            }
          } finally {
            deleter.decRef(files);
          }
        }
      }
    } catch (OutOfMemoryError oom) {
      handleOOM(oom, "addIndexes(IndexReader...)");
    } finally {
      if (docWriter != null) {
        docWriter.resumeAllThreads();
      }
    }
  }

Question: if we've just added the new SI to segmentInfos, why do we sync on this and check if it exists (when we create the compound file)? Is it because there could be a running merge which will merge it into a new segment before we reach that point?

What do you think? Is that what you had in mind about merging on the side and committing in the end?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Question: if we've just added the new SI to segmentInfos, why do we sync on this and check if it exists (when we create the compound file)? Is it because there could be a running merge which will merge it into a new segment before we reach that point?

Yes, exactly.

What do you think? Is that what you had in mind about merging on the side and committing in the end?

Yup! This looks great.... though I think you should move the docWriter.updateFlushedDocCount into the sync above it? We didn't have to do this before because we blocked all add/updateDocument calls.

Also, you shouldn't call docWriter.resumeAllThreads (you didn't pause them).

So this change is a great step forward in concurrency of addIndexes(IndexReader...)!

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Also, you shouldn't call docWriter.resumeAllThreads (you didn't pause them).

Oops, missed that :). Thanks !

I'll replace addIndexes w/ this code and run tests to check how it flies.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I've looked into implementing registerIndexes, and that's the approach I'd like to take:

For each incoming Directory, read its SegmentInfos
For each SegmentInfo:
- Generate a new segment name
- List its files
- Copy them from incoming Dir to local Dir, w/ the new segment name
- Add such SI to the local IW segmentInfos
- Update DW docCount, like it's done in the addIndexes* methods.

Few things:

Does that sound reasonable? Am I missing something?
Directory exposes a copyTo(Dir, Collection) which I thought to use. But the files are copied to the target Dir w/ their current name - while I need to copy them over w/ their new name.
- Adding rename to Dir feels wrong and dangerous to me
- Adding copyFile(Dir, String old, String new) seems ok
- Adding a variant of copyTo which accepts a Collection of the new names - the src and new should align. This also seems ok to me.

I'd like to use Directory for the copy, since impls of Dir may do the copy very efficiently (i.e. FSDir vs. RAMDir) and I don't want to use IndexInput/Output for that.

Do you know of another way I can achieve that? I only want to copy the actual segment files, w/o .gen and segments_N, so calling SI.files() seems ok?

Another question that popped into my head was about consistency of the incoming Dirs vs. the local one, w.r.t. to CFS files - should I worry about that? I think not because today one can create an index w/ CFS and then turn it off and some segments will be compound and others not?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I've looked into implementing registerIndexes, and that's the approach I'd like to take:

This looks good.

Though if the src segments share docStores, you can't do a simple copy (I think you have to fallback to the resolveExternalSegments approach for such segments).

Does that sound reasonable? Am I missing something?

I think this should work!

If the src segments are an older index rev, I think you are still OK. They will just remain "old" on copy, and merge will eventually migrate them forward.

For trunk... you should note in the jdocs that no codec conversion takes place. So the CodecProvider used in IW (and later used to read this index) must know how to provide the codec used by the src segments.

Directory exposes a copyTo(Dir, Collection) which I thought to use. But the files are copied to the target Dir w/ their current name - while I need to copy them over w/ their new name. Adding rename to Dir feels wrong and dangerous to me Adding copyFile(Dir, String old, String new) seems ok Adding a variant of copyTo which accepts a Collection of the new names - the src and new should align. This also seems ok to me. I'd like to use Directory for the copy, since impls of Dir may do the copy very efficiently (i.e. FSDir vs. RAMDir) and I don't want to use IndexInput/Output for that.

Do you know of another way I can achieve that? I only want to copy the actual segment files, w/o .gen and segments_N, so calling SI.files() seems ok?

SI.files() should be fine.

I think falling back to copyFile is best? Then copyTo could use it.

Another question that popped into my head was about consistency of the incoming Dirs vs. the local one, w.r.t. to CFS files - should I worry about that? I think not because today one can create an index w/ CFS and then turn it off and some segments will be compound and others not?

I think that's fine, but we should advertise in the jdocs.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

So it sounds like addIndexes should really be that registerIndexes. Specifically it should do a quick and dirty copy of all segments that don't share doc stores, and then resolveExternals those that do? Maybe we can get rid of those transactions and not block add/update/delete/commit/addIndexes attempts anymore?

Usually, I expect this to be a win-win. In cases where you add Directories w/ plenty of segments that share doc stores it will be slower, because we won't utilize MP and MS. But this can be improved in the future as well by e.g. just taking care of the shared doc stores, and don't remove deleted document entries etc.

But .. it will prevent (in some cases) the use of PayloadProcessorProvider ... hmm. So it seems we do need a separate registerIndexes for the really quick & dirty addIndexes operation.

BTW, Directory.copyTo* should be replaced w/ Directory.copy(Dir, File, File), for a couple of reasons:

There's no reason to believe the dest file should be named the same as the source file
The method is not entirely safe - only jdocs protect the user from doing really stupid thing such as overwriting the segments* files.
I don't see a proper usecase for that method, other than copying a Directory into an empty one. At least, other use cases are very dangerous.

Instead, the user should do this logic outside - call dir.listAll(), w/ and w/o FilenameFilter and copy the files of interest, and be allowed to rename them in the process.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

So ... after I slept over it, I don't think I can easily let go of the so-near victory – having addIndexes not blocking anything, done on the side, be as efficient as possible and give up a chance to do some serious code cleanup :). So I'd like to propose the following, we can discuss them one by one, but they all aim at the same thing:

addIndexes(Directory ...) will be a quick & dirty copy of segments. No deleted docs removal in the process, no merges.
- It's clear how this can be done for segments that don't share doc stores.
- What isn't clear to me is why can't this work for segments that do share doc stores - can't we copy all of them at once, and add to segmentInfos when the segments + their doc store were copied? So, if I have 5 segments, w/ the following doc stores: 1-2, 3 and 4-5 (3 stores for 5 segments), can't I copy 1+2+store, 3, 4+5+store? Wouldn't that work? I'll give it a shot.
PayloadProcessorProvider won't be used in the process. If you want, you can set it on the source writer, and call optimize(). Performance-wise it will be nearly identical - instead of processing the postings + payloads during addIndexes, you'll do that during optimize() (all segments will be processed anyway) and addIndexes will do a bulk IO copy, which on most modern machines is really cheap.
- Further more, you will end up w/ just one segment, which means it can be copied at once for sure.
- It will also simplify PPP – no need for DirPP anymore. PPP would get a Term and return a PP for that term, as it will always work on just one Directory.
- People can still use it with the target IW if they want, but not through addIndexes.
Apps can call maybeMerge or optimize() following addIndexes, if they want to.

What remains is addIndexes(IndexReader...) and I'm not sure why this cannot be removed. In the back of my head I remember a discussion about it once, so I guess there is a good reason. But at least from what I see now, and armed w/ my use cases only, it seems like even if you use an extension of IndexReader you should still be able to do a bulk copy? Hmm ... unless if your extension assumes different postings structure or something like that, which the regular SegmentReader won't know about – then during addIndexes those postings are converted.

But, how common is this? And wouldn't it be better if such indexes are migrated beforehand? I mean, anyway after addIndexes those postings won't retain the custom IndexReader-assuming format? Or is there another reason?

If we go with that, then SegmentMerger can be simplified as well, assuming only SegmentReader?

What do you think?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

What isn't clear to me is why can't this work for segments that do share doc stores

You are right!

If we copy over the doc stores, also renaming them, and fixup the incoming SegmentInto to reference the newly named one, this should work fine!

PayloadProcessorProvider won't be used in the process.

This (and not needing DirPP anymore) is a great simplification.

What remains is addIndexes(IndexReader...) and I'm not sure why this cannot be removed.

I think we still need it... look at how multi-pass index splitter (contrib/misc) works.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

look at how multi-pass index splitter (contrib/misc) works.

I see ...

I think this can be achieved by also deleting docs that fall outside the split range and calling optimize() / expungeDeletes()? So, you copy the index once (using one of the copying addIndexes methods), delete the docs that you don't care about and optimize/expunge. Then copy the index again and repeat the process, w/ a different range of ids. In fact, that's more or less what the method does - only it calls addIndexes, to copy into an existing/empty Directory.

So I think it can be changed to not call addIndexes? Only problem is that now the method adds the docs into the target Directory directly, while in the other solution it will need to create a Directory on the side w/ the range of requested IDs and then copy that one into the target Dir?

But I'm not sure that's worth it to have addIndexes(IndexReader...) and the relevant code in SM which handles the non-SegmentReader readers? Of course, this is just one scenario, but if that's our justifying case, then I'm not sure about how justifying it is.

asfimport commented 14 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

FYI, I'm working on a different version of IndexSplitter that uses the logic in SegmentMerger directly, without going through IW.addIndexes(FilterIndexReader).

However, there are other applications for which this API is crucial, e.g. #2887 or IndexSorter (in Nutch) - in short, any client apps that want to merge-in index data that does not correspond 1:1 to a Directory. For this reason I think the pair of IndexWriter.addIndexes(IndexReader...) and FilterIndexReader abstraction is extremely useful and that IndexWriter.addIndexes(Directory...) is not a sufficient replacement.

(edit: unless there is a better user-level API based on the flex producers/consumers...)

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

any client apps that want to merge-in index data that does not correspond 1:1 to a Directory

I understand that. But when you call addIndexes w/ such IndexReaders, all they do is read the postings. Those are written down using the logic of the target IndexWriter. So I wonder how important is it for addIndexes to be in place, rather than say rewriting those indexes before they are added? I mean, all that addIndexes will do is call SegmentMerger and iterate on the readers and segments, merging the posting lists ...

I don't object to that API. But, SM is used extensively, and is more of a main-path code, while addIndexes(IndexReader) is something only few out there use. Yet it affects everyone else who reads SM code, as well as those of us who are confused about which method to call (Reader or Directory) ... It almost feels like such operation - the relevant code from SM which handles non-SegmentReaders, should be extracted to a utility or something. But if I'm the only one that's bothered by it, then so be it. I can take care of the rest now, and resolve that one later.

asfimport commented 14 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

I understand - see the edited section in my comment: I think that extracting this non-SR code would be great. I would be in fact glad if there was an easier to control API that allows us to directly stream-process postings / stored / tvf-s / etc. in a way that results in a functioning index. Take for example #2887 - the only reason it uses addIndexes(IndexReader) is that there was no easy way to modify postings in a way that would still result in a valid index, and there was no other API to add artificially created postings (i.e. not coming from a Directory) to a target index.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

So can't the PrunningReader run on the side, converting the postings to whatever they're supposed to look like in the index they are about to be added to, and then call addIndexes w/ the Directory to do the bulk copy? I mean, instead of looking for a standalone tool, perhaps this can be solved on a case by case basis? Of course, if this can be made generic enough, then we can add it as a core utility, or IW method.

asfimport commented 14 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

So can't the PrunningReader run on the side, converting the postings to whatever they're supposed to look like

Erhm ... Currently the only way in the user API to write out existing postings (no matter how created) is to use IndexWriter.addIndexes(IndexReader). We can read postings just fine, using various *Enum classes that we can obtain from IndexReader, but there are no comparable high-level output methods - Codecs and other flex classes are IMHO too low-level.

Also, with large indexes the amount of IO/CPU for writing out a Directory and reopening it is non-trivial - it's much more efficient to do this via streaming from the original, already open index.

Also, if we remove this method, then FilterIndexReader may as well go too, because it loses its utility.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok let's keep addIndexes(IndexReader) around. This means though that we cannot simplify the PPP API. We'll still need DirPP.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I've started to implement addIndexes(Directory...) as agreed - copy files from the incoming ones into the local directory, while renaming them on the fly. This works really well with non-CFS segments: a new segment name is generated, the incoming files are renamed and this all flies smoothly (didn't test w/ deletions yet) - even shared doc stores work great.

But with CFS it doesn't work well because CFS writes the file names in the CFS file itself, and so even if the segment is renamed to _5 (for example), the names that are written in the file are _2.* (for example), and openInput fails to locate them. To overcome this, I propose we do the following:

Introduce on IndexFileNames a stripName method (3x and trunk) - will return the file name w/o the _x part.
CFR ctor - strip names of read file names by calling IFN.stripName --> 3x only
CFR.openInput - strip name by calling IFN.stripName --> 3x and trunk
Document that files should be created through IFN only --> 3x (for clarity) and trunk (otherwise may not be supported).
Not save the name in CFS --> trunk only. Will remove the need to strip it off when it's read.

That will ensure that files are named following a certain convention which we can rely on in CFR. I don't think it's too hard to ask for. CFS itself already knows the name - it's named like it. So there's no value in storing the names of the files it holds.

For 3x it should work well b/c we don't allow for custom index files. For trunk we'll ask to go through IFN to name files - so one can create mycustom.file through IFN which will be called _x_mycustom.file.

What do you think?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahh sneaky that CFS still embeds the old segment's name (you're right). The only other option would be to rewrite the CFS header, but then that's not easy to do a bulk copy on. So I like you're approach!

We should document in Codec.java that this (you must gen your filename via IFN's APIs) is a requirement of any custom files your codec wants to store in the index.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch includes the following:

addIndexesNoOpt renamed to addIndexes2 for now, until we resolve the failing test (see below). I'll remove it and fix jdocs accordingly afterwards.
addIndexes(Dir...) implements the simple file copy strategy.
Tests updated accordingly.
Some minor changes to CompoundFileReader and IndexFileNames, as discussed before.

All tests pass except for TestIndexWriter.testAddIndexesWithThreads. I've debugged it, but cannot find the reason. addIndexes copies all segments, before it adds them to the writer's segmentInfos. Maybe I need to use start/commit transaction on that part only, to lock all ops? I don't see why, but maybe?

Also, TestAddIndexes.testWithPendingDeletes2() (and some others) fail before I added a call to flush to addIndexes. It seems that w/o it, existing buffered deleted docs are ignored after addIndexes returns (even when no multi-threading is involved). Can someone please confirm that?

Also, I cannot simplify PPP (to remove DirPP) because we kept addIndexes(Reader...). It's an annoyance if you don't call this method (need to return a DirPP for the target Dir always - if you want to use it), but maybe not so bad ...

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Attached patch includes:

Fixes a bug that caused some tests to fail.
CFS is now versioned:
- CFW writes a version header, and CFR reads it
- CFW strips the segment name from the filename before writing it
- CFR back-supports pre-3.1 indexes depending on the existence/absence of the version header.
TestBackwardsCompatibility now covers 3.0 indexes as well, and addIndexes* ops.

The beauty of all this is that IndexWriter no longer needs those transactions, and is now 500 lines of code + jdoc down !

After we've iterated through this patch, I'll do the same changes on trunk. Backwards support should be much easier there, because we will provide an index migration tool anyway, and so CFW/CFR can always assume they're reading the latest version (at least in 4.0). CFW should probably use CodecUtils in trunk - it cannot be used in 3x because of how CFW works today - writing a VInt first, while CodecUtils assumes an Int. And I don't think it's healthy to do so much changes on 3x.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks great! So awesome seeing all the -'s in IW.java!! Keep it up :)

And it's great that you added 3.0 back compat case to TestBackwardsCompatibility...

Some feedback:

Can you change the code to read to a "int firstInt" instead of version? And make an explicit version (say "PRE_VERSION"), and then check if version is PRE_VERSION in the code. Ie, any tests against version (eg version > 0) should be against constants (version == PRE_VEFRSION) not against 0.
CFW's comment should be "make it 1 lower" than the current one right? Ie, -2 is the next version?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Backwards support should be much easier there, because we will provide an index migration tool anyway, and so CFW/CFR can always assume they're reading the latest version (at least in 4.0).

Hmm I think we should do live migration for this (ie don't require a migration tool to fix your index)? This is trivial to do on the fly right (ie as you've done in 3.x).

CFW should probably use CodecUtils in trunk - it cannot be used in 3x because of how CFW works today - writing a VInt first, while CodecUtils assumes an Int. And I don't think it's healthy to do so much changes on 3x.

Hmm yeah because of the live migration I think CodecUtils is not actually a fit here (trunk or 3x).

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I'm not sure about the live migration, Mike. First because all the problems I've mentioned about CodecUtils in 3x will apply to live migration of 3.x indexes in 4.0 code. Second, if everyone who upgrades to 4.0 will need to run the migration tool, then why do any work in supporting online migration? What's the benefit? Do u think of a case where someone upgrades to 4.0 w/o migrating his indexes (unless he reindexes of course, in which case there is no problem)?

I just think it's weird that we support online migration together w/ a migration tool. If we migrate the indexes w/ the tool to include the new format of CFS, then the online migration code won't ever run, right? And not doing this in the tool seems just a waste? I mean the user already migrates his indexes, so why incur the cost of an additional online migration?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sorry – for each major release, I think it'll be either live migration or offline migration, but not both.

So far for 4.0 we haven't had a major enough structural change to the index format, that'd make live migration too hard/risky, so, so far I think we can offer live migration for 4.0.

The biggest change was flex, but it has the preflex codec to read (not write) the pre-4.0 format... so, so far I think we can still offer live migration for 4.0?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ahh, I knew we must be talking past each other :). I assumed that the flex changes will go under the migration tool. If we have live migration for it, then I agree we should do live migration here.

With that behind us, did someone start an API migration guide? Since I remove addIndexesnoOptimize in favor of the new addIndexes, I wanted to document it somewhere. It's a tiny change, so perhaps it can go other the API Changes in CHANGES?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

With that behind us, did someone start an API migration guide?

Not yet, I think? Go for it!

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

I will document it in CHANGES under API section. I think the migration guide format will need its own discussion, and I don't want to block that issue. When we've agreed on the format (people have made few suggestions), I don't mind helping w/ porting everything relevant from changes to that guide.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

CFW's comment should be "make it 1 lower"

Right ! I copied it from FieldsWriter where the versions are kept as positive ints. Will post a patch shortly.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch applies Mike's comments. I think this is ready to go in. I'd like to commit to 3x before trunk, because there are lots of changes here.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Could you fix "firstInt' to have a very short life?

Meaning, you read firstInt, and very quickly use that to assign to version & count, and no longer use it again. Ie, all subsequent checks when loading should be against version, not firstInt...

Also, can you maybe rename CFW.PRE_VERSION -> CFW.FORMAT_PRE_VERSION? (to match the other FORMAT_X).

Otherwise looks great!

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

The only place I see firstInt is used perhaps unnecessarily is in the for-loop. So I've changed the code to look like this:

int count, version;
if (firstInt < CompoundFileWriter.FORMAT_PRE_VERSION) {
  count = stream.readVInt();
  version = firstInt;
} else {
  count = firstInt;
  version = CompoundFileWriter.FORMAT_PRE_VERSION;
}

And then I query for version == CompoundFileWriter.FORMAT_PRE_VERSION inside the for-loop. Is that what you meant?

There is a check before all that ensuring that read firstInt does not indicate an index corruption – that should remain as-is, right?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Update w/ comments. I plan to commit this either later today or tomorrow (and then port it to trunk). So if you haven't done so and want a last chance review - that's your chance.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Shai! Thanks.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed revision 948394 (3x).

Will now port everything to trunk

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Shai,

I have seen this only lately. You added a 3.0 Index ZIP to the tests. This conflicts a little bit with trunk, where a 3.0 Index ZIP is already available. I would prefer to keep the "older version" ZIPs equal against each release, so it would be fine, if the trunk-added numerics backwards test could also be in 3.x branch. Would this be possible? You have to just merge the code.

Also it looks strange that the 3.0 backwards tests now contain also 3.0 index ZIPs, but there is no code for that??? Why have you added this to backwards? The 3.0 backwards tests should only modify this one addindexes test, but not add the zips. Maybe simple delete, they are not used.

By the way the 3.0 index zip file generation code is in the 3.0 branch, have you edited it there? You should commit the code there so one is able to regenerate the 3.0 ZIPs from the stable 3.0.x branch.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I looked at the code, it simply tests trhat old indexes can be added. Maybe you just copy the trunk ZIPs for 3.0 to the 3x branch to keep them consistent. The files dont seem to be equal.

apache / lucene

Some house cleaning in addIndexes* [LUCENE-2455] #3529