apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Supporting bulk copies in Directory [LUCENE-2471] #3545

Open asfimport opened 14 years ago

asfimport commented 14 years ago

A method can be added to IndexOutput that accepts IndexInput, and writes bytes using it as a source. This should be used for bulk-merge cases (offhand - norms, docstores?). Some Directories can then override default impl and skip intermediate buffers (NIO, MMap, RAM?).


Migrated from LUCENE-2471 by Earwin Burrfoot, updated May 09 2016

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

On #2659 I'm already introducing a copy(Dir, File, File) which is overridden in FSDirectory to implement using ByteBuffers (like you did on copyTo(Dir). So which directories would benefit from that? RAM only (because NIO and MMap already use FSDir's impl)?

I'm generally +1 for adding such API, just wandering who's the immediate consumer of it.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Bad link? The issue is closed already and no mentions of Directory in the patches.

Immediate consumer, just as I said - is all bulk-merging code. I.e. - instead of loading norms to a byte array and then writing them out, you do, roughly:

IndexInput normFile = ...;
IndexOutput newNormFile = ...;
newNormFile.write(normFile, offset, length);

I looked at FSDir and refreshed my memory. copyTo is implemented with channels and transferTo, I think new method will look quite similar.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Sorry, too many issue these days. I meant #3529. I've removed FSDir.copyTo mehods and instead created Dir.copy(Dir, File, File). Still need to upload the patch, with those changes.

But aside from that, I think the API you're talking about is good.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Ahem. Why did you remove them? :) The point was to have default impl on Directory and transferTo-optimized one on FSDirectory.

Ok, let's wait for your patch.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

The default impl still exist, only in the form of a single file copy instead of an entire directory. There were a couple of reasons to replace them:

  1. They didn't take a target-name API. when I'm copying the segments in addIndexes over on #3529, I need to rename then in the process (to reflect their new segment name), and the API did not exist.
  2. The API was very dangerous as it overwrote thr target files, no questions asked. So you could very easily overwrite one of the segments.

You can still accomplish that by iterating on the dir yourself and copy the files that you want, only you can do that selectively, leas risky and rename them in the process.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Ah. Actually there was two methods, one that copies entire directory, and another - selected files. The former is a legacy :) Only there for back-compat-loving folk to accept the patch.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Yes, I'm aware of the two methods. The one which accepts a Collection of files is better, but it still didn't allow you to rename them in the process. And adding another Collection argument, and require that the two will align seemed unnecessary. So src.copy(dest, from, to) seemed to be enough.

Copying an entire Directory is used by Lucene code only in RAMDir when it's init'ed w/ a Directory. Besides that, the scenario of copying an entire Dir is not really clear when it's useful. So the single file copy gives you as much flexibility as you need, and less chances of making crucial mistakes.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I agree: we should only expose the per-file copyTo, ie, the Directory shouldn't "own" the iteration through a collection of files; the caller can do that.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

The only reason for keeping that iteration within Directory was to reuse the buffer. The savings are neglectable, I think.

Copying an entire Directory is used by Lucene code only in RAMDir when it's init'ed w/ a Directory.

So what? :) I used copying an entire Directory for backup purporses, then switched to copyTo(collection), to cherry-pick a single commit.

Still I agree with switching to single file copy+rename. Back-compat luckily went out of the window, so we can design better APIs :) Can we do this in a separate issue from #3529 ?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

3529 depends on this, and the changes are very minor. If we do this in a separate issue, it will block my progress on #3529. Let me post a patch there today or tomorrow, and if we won't have consensus on the change, I'll open a separate issue, or reopen #3415?

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

I actually suggested separating so this minor patch goes in without being blocked by your progress on #3529 :)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think this issue makes sense, separate from #3529? Ie this issue is for bulk copying when you have IndexInput/Output already open (I don't think #3529 covers this?). Whereas #3529 is operating on file names...

asfimport commented 13 years ago

Earwin Burrfoot (migrated from JIRA)

Hmmm. Are we going to do this?

Optimized bulk copies IndexInput -> IndexOutput for merges.

I currently see II.copyBytes(IndexOutput out, long numBytes) method in trunk, but it's a little bit of a mess (II.copyBytes calls IO.copyBytes, strange overrides doing the same thing in various ways), no optimizations for FSDirectory (or at least NIOFSdirectory) case, no offset parameter? not used when bulk-merging? (well, DataOutput.copyBytes is used, but there's a single inefficient version of it)

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

At some point IndexInput/Output.copyBytes did use FileChannel optimization in FSDirectory, but that caused troubles I think when the copying thread was interrupted. So it was removed and we were left w/ the default impl.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think the problem actually wasn't interrupting but some sort of race condition?

Either way, I think its sad we had to disable the optimization, and it would be nice if we put it back (safely). But some notes:

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think the problem actually wasn't interrupting but some sort of race condition?

Could be, I don't remember the exact details.

I totally agree with you, though it's like a "hen and egg" situation - we cannot develop anything safe until we have good threaded unit tests, and we can never know we have those until we have any implementation that might break. So I personally don't mind if we pursue implementation of FileChannel copying, in NIOFSDirectory only, and then investigate the current threaded indexing/search tests and add some if we think something's missing. But currently we're in sort of a limbo :).

Anyway, I don't think it's related to that issue and can be handled in a separate issue. If you agree, and assuming nothing more should be done here, we can close this one.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

OK I opened #3878, this will improve our tests by ensuring they use different implementations and are completely reproducable across different operating systems. Hopefully there aren't many tests left to fix.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.