Open asfimport opened 14 years ago
Shai Erera (@shaie) (migrated from JIRA)
On #2659 I'm already introducing a copy(Dir, File, File) which is overridden in FSDirectory to implement using ByteBuffers (like you did on copyTo(Dir). So which directories would benefit from that? RAM only (because NIO and MMap already use FSDir's impl)?
I'm generally +1 for adding such API, just wandering who's the immediate consumer of it.
Earwin Burrfoot (migrated from JIRA)
Bad link? The issue is closed already and no mentions of Directory in the patches.
Immediate consumer, just as I said - is all bulk-merging code. I.e. - instead of loading norms to a byte array and then writing them out, you do, roughly:
IndexInput normFile = ...;
IndexOutput newNormFile = ...;
newNormFile.write(normFile, offset, length);
I looked at FSDir and refreshed my memory. copyTo is implemented with channels and transferTo, I think new method will look quite similar.
Shai Erera (@shaie) (migrated from JIRA)
Sorry, too many issue these days. I meant #3529. I've removed FSDir.copyTo mehods and instead created Dir.copy(Dir, File, File). Still need to upload the patch, with those changes.
But aside from that, I think the API you're talking about is good.
Earwin Burrfoot (migrated from JIRA)
Ahem. Why did you remove them? :) The point was to have default impl on Directory and transferTo-optimized one on FSDirectory.
Ok, let's wait for your patch.
Shai Erera (@shaie) (migrated from JIRA)
The default impl still exist, only in the form of a single file copy instead of an entire directory. There were a couple of reasons to replace them:
You can still accomplish that by iterating on the dir yourself and copy the files that you want, only you can do that selectively, leas risky and rename them in the process.
Earwin Burrfoot (migrated from JIRA)
Ah. Actually there was two methods, one that copies entire directory, and another - selected files. The former is a legacy :) Only there for back-compat-loving folk to accept the patch.
Shai Erera (@shaie) (migrated from JIRA)
Yes, I'm aware of the two methods. The one which accepts a Collection of files is better, but it still didn't allow you to rename them in the process. And adding another Collection argument, and require that the two will align seemed unnecessary. So src.copy(dest, from, to) seemed to be enough.
Copying an entire Directory is used by Lucene code only in RAMDir when it's init'ed w/ a Directory. Besides that, the scenario of copying an entire Dir is not really clear when it's useful. So the single file copy gives you as much flexibility as you need, and less chances of making crucial mistakes.
Michael McCandless (@mikemccand) (migrated from JIRA)
I agree: we should only expose the per-file copyTo, ie, the Directory shouldn't "own" the iteration through a collection of files; the caller can do that.
Earwin Burrfoot (migrated from JIRA)
The only reason for keeping that iteration within Directory was to reuse the buffer. The savings are neglectable, I think.
Copying an entire Directory is used by Lucene code only in RAMDir when it's init'ed w/ a Directory.
So what? :) I used copying an entire Directory for backup purporses, then switched to copyTo(collection), to cherry-pick a single commit.
Still I agree with switching to single file copy+rename. Back-compat luckily went out of the window, so we can design better APIs :) Can we do this in a separate issue from #3529 ?
Shai Erera (@shaie) (migrated from JIRA)
Earwin Burrfoot (migrated from JIRA)
I actually suggested separating so this minor patch goes in without being blocked by your progress on #3529 :)
Michael McCandless (@mikemccand) (migrated from JIRA)
I think this issue makes sense, separate from #3529? Ie this issue is for bulk copying when you have IndexInput/Output already open (I don't think #3529 covers this?). Whereas #3529 is operating on file names...
Earwin Burrfoot (migrated from JIRA)
Hmmm. Are we going to do this?
Optimized bulk copies IndexInput -> IndexOutput for merges.
I currently see II.copyBytes(IndexOutput out, long numBytes) method in trunk, but it's a little bit of a mess (II.copyBytes calls IO.copyBytes, strange overrides doing the same thing in various ways), no optimizations for FSDirectory (or at least NIOFSdirectory) case, no offset parameter? not used when bulk-merging? (well, DataOutput.copyBytes is used, but there's a single inefficient version of it)
Shai Erera (@shaie) (migrated from JIRA)
At some point IndexInput/Output.copyBytes did use FileChannel optimization in FSDirectory, but that caused troubles I think when the copying thread was interrupted. So it was removed and we were left w/ the default impl.
Robert Muir (@rmuir) (migrated from JIRA)
I think the problem actually wasn't interrupting but some sort of race condition?
Either way, I think its sad we had to disable the optimization, and it would be nice if we put it back (safely). But some notes:
Shai Erera (@shaie) (migrated from JIRA)
I think the problem actually wasn't interrupting but some sort of race condition?
Could be, I don't remember the exact details.
I totally agree with you, though it's like a "hen and egg" situation - we cannot develop anything safe until we have good threaded unit tests, and we can never know we have those until we have any implementation that might break. So I personally don't mind if we pursue implementation of FileChannel copying, in NIOFSDirectory only, and then investigate the current threaded indexing/search tests and add some if we think something's missing. But currently we're in sort of a limbo :).
Anyway, I don't think it's related to that issue and can be handled in a separate issue. If you agree, and assuming nothing more should be done here, we can close this one.
Robert Muir (@rmuir) (migrated from JIRA)
OK I opened #3878, this will improve our tests by ensuring they use different implementations and are completely reproducable across different operating systems. Hopefully there aren't many tests left to fix.
Steven Rowe (@sarowe) (migrated from JIRA)
Bulk move 4.4 issues to 4.5 and 5.0
Uwe Schindler (@uschindler) (migrated from JIRA)
Move issue to Lucene 4.9.
A method can be added to IndexOutput that accepts IndexInput, and writes bytes using it as a source. This should be used for bulk-merge cases (offhand - norms, docstores?). Some Directories can then override default impl and skip intermediate buffers (NIO, MMap, RAM?).
Migrated from LUCENE-2471 by Earwin Burrfoot, updated May 09 2016