Make CFS appendable [LUCENE-3218]

asfimport commented 13 years ago

Currently CFS is created once all files are written during a flush / merge. Once on disk the files are copied into the CFS format which is basically a unnecessary for some of the files. We can at any time write at least one file directly into the CFS which can save a reasonable amount of IO. For instance stored fields could be written directly during indexing and during a Codec Flush one of the written files can be appended directly. This optimization is a nice sideeffect for lucene indexing itself but more important for DocValues and #4289 we could transparently pack per field files into a single file only for docvalues without changing any code once #4289 is resolved.

Migrated from LUCENE-3218 by Simon Willnauer (@s1monw), resolved Nov 04 2011 Attachments: LUCENE-3218_3x.patch, LUCENE-3218_test_fix.patch, LUCENE-3218_tests.patch, LUCENE-3218.patch (versions: 7) Linked issues:

4274
- 4289

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

first sketch still some nocommits - this patch includes the latest patch from #4274 which made the CFS part of directory. This patch adds write support to the CompoundFileDirectory. The CFWriter tries to write files directly to the CFS if possible like when no other file is currently open for writing it opens a stream directly on the CFS. Yet, this change also adds a new file to the CFS (.cfe) which only holds the entry table which makes all seeks unneeded (plays better with AppendingCodec).

I currently don't use it during indexing since we decided after flush if we use CFS or not. Yet this might change with this optimization but I will leave this to another issue.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks cool!

So the CFW will take the first output opened against it and let it write directly into the "actual" CFS file, and then if another file is opened while that first one is still open, the 2nd file will write to separate file and then will copy in on close. We may want to delegate the separate files too? So that on close they copy themselves into the CFS and remove the original? This way IW won't have to separately create CFS in the end.

Somehow we need IW to add the biggest sub-file first...

s/compund/compound

CFW.close should assert currentOutput != null (and, if we delegate sep entries, that they are also all closed)?

You might need to sync the CompoundFileWriter.this.currentOutput test / setting to null? Though... Lucene is always single threaded in writing files for the same segment, today anyway.

Can we make a separate createCompoundOutput? (Ie, instaed of passing OpenMode to openCompoundInput). And: I'm assuming a given compound output can only be opened once, appended to / separate files copied into, closed and then never opened again for writing? (Ie, still "write once" at the file level).

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

next iteration - seems close.

moved CFW to o.a.l.store and made package private.
added createCompoundOutput to Directory instead of passing OpenMode
added write support to CompundFileDirectory
Separately written file are appended during close if possible (no other file is currently written directly to the CF). If files is locked append happens once that file is closed.
IW uses Directory methods only, addFile has been converted to Directory#copy

once thing which still bugs me is the setAbortCheck on CFDirectory.. I wonder if we can solve that differently, ideas?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

updated patch NOW containing all files :)

sorry for the missing files in the last patch

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks great!

Can we name it createCompoundOutput? Emphasizes that we are write-once (this file shouldn't exist), and matches createOutput.

On checkAbort... we could not send that to the CFW and instead call checkAbort in the outer loops? (Ie, where we .copy the files in). The existing CFW already only checks once-per-file anyway...

Maybe instead of asserts for the mis-use of the CFD API (eg no entries, something is still open), we should make these real exceptions (ie, thrown even when assertions are off)?

This comment looks stale (in CFW.java)?:

      // Close the output stream. Set the os to null before trying to
      // close so that if an exception occurs during the close, the
      // finally clause below will not attempt to close the stream
      // the second time.

openCompoundOutput needs javadoc.

CFD.createOutput's jdoc says Not Implememented but it is.

The new test cases in TestCompoundFile names its file d.csf ;) Column stride fields lives on!! Too many tlas...

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

final patch.

fixed javadocs + several javadoc warnings
renamed openCompoundOutput to createCompoundOutput
fixed file extensions in test CSF LOL!!
copyFileEntry now deletes files that are separately written once copied into the CFS.
converted asserts to exceptions in CFW

I plan to commit this today if nobody objects.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Committed in revision 1138063. I will try to backport this to 3.x if possible

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a patch against 3.x. I had to change one test in lucene/backwards and remove some tests from there which used the CFW / CFR.

A review would be good here!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Simon, currently this attached patch fails... not sure why yet.

But I think we should resolve this tests issue before backporting

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

thank you robert, while this has actually been tested since its in the base class though its now cleaner. The test failure came from RAMDirectory simply overriding existing files. I added an explicit check for it.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Simon, I feel better now that we get our open-files-for-write tracking back.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

backported to 3.x - thanks guys

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

See #4447, #4453.

There are some serious traps here with how the CFE files are created for "delegator"-like Directory impls such as FileSwitchDirectory and NRTCachingDirectory.

With such dirs that might have policies the writer is "backdooring" their decision about where files should go since it only has a reference to the "sub" directory.

I think this is non-intuitive and we should do something to try to prevent surprises.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Can CFS reading/writing not take a parent directory, instead of:

CompoundFileDirectory(Directory parent, ....)

I think it should be CompoundFileDirectory(IndexInput cfs, IndexInput cfe)

And directory.createOutput etc should take both filenames, this would remove this backdooring completely.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Maybe we can avoid making a separate _X.cfe file?

We did this because previously the CFS stored the header in the front of the file (I think)?

Could we, instead, put the header at the end of the file, but place a long pointer at the start of the file saying where the header is located (I'd rather not rely on file.length())? Then we could have a single (_X.cfs) file again and we can not use the Dir impl for delegation?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

We did this because previously the CFS stored the header in the front of the file (I think)?

is this really the problem here? I mean this problem is in FileSwitch / NRT Directory. The CFS uses a directory to write files, I would expect that if we use for instance NRT directory it gets the NRT directory instead of either of of its sub directories. its not really a CFS problem IMO and we should rather fix the actual directory rather than reverting the small optimization having the header in a separate file. i think we should prevent the seek if not absolutely necessary.

asfimport commented 13 years ago

Marvin Humphrey (migrated from JIRA)

I don't fully grok Robert's concern, but with regards to Mike's suggestion of inlining the metadata: Why not put that file pointer at the very end of the file? So that the read-time sequence of actions is: seek to 8 bytes before the end, read the file pointer, seek back to beginning of metadata.

That way you don't need to seek backwards during writing, which IIRC used to be an issue for Hadoop.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This is definitely not a bug in the directory, and its a serious issue (i think a blocker for release myself).

I'll try to explain the issue again a little better than I did on https://issues.apache.org/jira/browse/LUCENE-3380?focusedCommentId=13086872&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13086872

This is just an example of the API problem, with FileSwitchDirectory.

In Lucene we have FileSwitchDirectory which is a Directory that lets you "switch" between 2 different directory implementations based on file extension. So conceptually it looks like this:

FileSwitchDirectory extends Directory {
  Directory a;
  Directory b;
  Set extensions; // these are the file extensions that go to "a", all other ones are handled by "b"
}

Imagine you configure this directory to put all "*.cfs" in "a", and everything else in "b".

So when FileSwitchDirectory is asked where to put "1.cfs", it forwards the request to "a".

But the "1.cfe" file is actually wrongly created in "a" also, causing FileNotFoundExceptions later when the file is to be read, because its in the wrong directory. This is because of how the compound file mechanism works now, it calls a.createOutput("1.cfe") instead of fileswitchdirectory.createOutput("1.cfe").

So this is a serious problem for any Directories that delegate responsibility like this, not just the ones in Lucene.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Thanks Robert for explaining this again, I agree 100% with you, the current cfe/cfs discussion is really serious and the current impl is heavy broken.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe we can avoid making a separate _X.cfe file?

+1, this sounds great (however it can be done, ideally with Marvin's idea to support appendable-only filesystems also), and would end the confusion here.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

i think we should prevent the seek if not absolutely necessary.

You have a very small optimization here that only affects opening the CFS.

But because we need to fix the wrong behaviour in FileSwitch (and also NRTCaching dir, which is in my opinion more serious!!!!), FileSwitch and NRTCachingDir now use the default CompoundFileImpl. If you wrap MMapDir by FileSwitch or NRTCaching, the whole custom impl of the compound file in MMap that speeds up even further is obsolete, as not used (you can use the compound file with really no slowdown at all as we can map parts of the CFS file into memeory and need no offset calculations and can also save mapping costs).

This is gone now, just because a one-time seek at opening time is prevented.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Right, the fix I applied is really a hack, but I didnt want to leave our codebase broken while we figure this out.

Its not just a problem from a performance perspective, I think its just bad to make assumptions about how the inner directory works. In this case with fileswitchdirectory etc, it really should be fully delegating this stuff down, and be clueless about how its implemented by the underlying sub directory.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Why not put that file pointer at the very end of the file? So that the read-time sequence of actions is: seek to 8 bytes before the end, read the file pointer, seek back to beginning of metadata.

I would rather not rely on metadata (file length) when reading, only the contents of the file.

I think append-only filesystems (eg HDFS) can make their own impl that uses the file length instead (like AppendingCodecc).

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Anyone have opinions on 3.x?

I think it might be wise to think about pulling these CFS changes (LUCENE-3201, too) back from 3.x and instead give things some time to settle in trunk?

We could always then backport at a later release, but we got lucky here that we still haven't released anything.

asfimport commented 13 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

I think append-only filesystems (eg HDFS) can make their own impl that uses the file length instead (like AppendingCodecc).

AppendingCodec solves only one issue, that of postings and SegmentInfos. I'm worried that adding seek+rewrite tricks in other places that are not under the control of Codec or under any other configurable implementation (such as CFS) will ultimately prevent the efficient use of Lucene on Hadoop. Unless we put those places under the control of a Codec (or some other configurable interface).

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The trick with the latest updates to compound files is that the CompoundFileWriter/Reader is returned by the directory implementation - and this is broken and the discussion is about this. So this would be the place, where you theoretically could completely make another CFS on-disk format or e.g. write the stuff to a ZIP file :-)

See here: https://builds.apache.org/job/Lucene-3.x/javadoc/core/org/apache/lucene/store/Directory.html#createCompoundOutput(java.lang.String)

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

personally I think we should try to be append only on general. So eventually this is about creating the cfe and cfs file from the "right" directory. What we could do to use the parent ie. FileSwitchDir etc. is add a protected method that allows passing the parent dir to the createCompoundOutput / openCompoundInput which is then in turn used to create the actual files. We can call this method from the public createCompoundOutput / openCompoundInput versions with "this" as the directory to create files. How does that sound? Lemme know if I miss something...

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I disagree, we don't need to compensate for hadoop's problems.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

If we want append only, we should also remove seek methods from IndexOutput... I DISAGREE, too!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

So eventually this is about creating the cfe and cfs file from the "right" directory.

That's not the only issue: while that is the primary reason I reopened this issue I also have concerns about the API being complicated and non-intuitive.

Making the API even more complicated because Filesystem X can only write WingDings or cannot seek doesn't seem to be a good solution to me.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

When looking into the CompoundFileDirectory code I also found a small bug in version handling. readEntries() reads the first VInt and uses it for version checking (if negative). This check has 2 problems:

if the VInt is smaller then FORMAT_CURRENT it should throw IndexTooNewException
the comparison should not be against FORMAT_CURRENT itsself (this constant should only be used for writing CFS files), it should compare against real version numbers. This would otherwise break on later additions of new formats.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think the situation here is too complicated already, we are discussing all kinds of complicated stuff and I dont think "appendable" CFS is worth any of this.

I think we should back out these CFS changes for now.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should back out these CFS changes for now.

+1

Generally if we add a cool optimization and it turns out that optimization risks even just apparent index corruption and/or adds scary traps / confusing complexity to the API I think we should pull the change and iterate on the issue / branch until these problems are addressed?

We had a similar experience with copyBytes, but that time it was real corruption.

Optimizations aren't worth such risks I think, especially if it's only an index-time opto?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think we should back out these CFS changes for now.

-1

I think we are over reacting here, especially robert gets too crazy about this. Honestly I think CFS should be detached from directory and we should make it a delegating directory if at all. That way we would always operate on the right directory, can safely create two files and keep Directory itself clean. We can still add the ability to partially map a certain file (offset, length) into memory like we do now in the specialized CFS Dirs. This entire think is not a problem of appending at all IMO.

how does that sound? I think this would solve all the problems we are having and keeps it appendable.

simon

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi when thinking about the whole stuff one more time again, I may have a solution to again decouple CFS from the parent directory, so one can create any CFS using one single class (but perhaps the factory in directory is still an idea to make it customizable). There are several solutions, but most of them have customization problems:

The current approach was discussed already, nothing more to say
A possibility to make it possible for MMap to map certain parts of the file is to move the getIndexInputSlice up to the abstract Directory base class and make the default implementation the current CFIndexInput from the default CFS impl. This would be even backwards compatible. So the CFS impl can simply ask the parent directory it warps for a slice. The problem here is easy: Current CFS impl opens the CFS file exactly one time and consumes exactly one file handle. The slices work on the same file handle. If we move the slice handling up to the directory, the "state" is gone, so handling the all-the-time open CFS file cannot be managed anymore. When using a new file handle for each slice, we gain nothing (CFS is to reduce file handles).
Last night I had one idea that might fix this issue. Lets move the slice handling into the abstract IndexInput base class, again the default impl would simply use the current CFIndexInput to return a slice. In the case of MMapIndexInput it would simply return a remapped slice on the current file handle. The only thing that would change is that the RAF would kept open the wohle time (like MMapCFDirectory does), in contrast to curren, where th RAF is closed directly after mapping. This approach would allow it for the CFS impl to simply ask it parant directory for an IndexInput to handle the SFC file itsself and for each sub-slice ask this IndexInput for this.

The last approach seems reasonable, but we need some more checks how to implement that. The last approach keeps both "features" of CFS:

One OS file handle
possibility for certain directory implementations to return sliced IndexInputs in an optimal way. The current IndexInput have a clone method, in this case we would need a similar method, where you can give offset and length.

On the other hand, we can remove the "factory" for CFS files from directory, we can go back to a simple new CFSDirectory(parentDirectory, cfsName).

Does this sound reasonable?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

None of this is reasonable.

When something goes wrong with an optimization, and multiple people ask for you to back it out, back it out.

then later we can discuss how to re-implement it.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

But we can still consider this as solutions to solve the issue later? I just dont want to make suggestions with lots of brainwork and sleepless nights involved, if it's not considered and just be backed out with "None of this is reasonable.".

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think both Simon's and Uwe's ideas are good and should be explored! With all these ideas we will find a clean way to get CFS reading/writing integrated into Directory.

But I think that exploration should just be outside of trunk and 3.x, eg on a branch. Once we iterate to a good point again we can commit it back to trunk, let it bake/age, then merge back to 3.x if it seems stable.

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

+1 on backing out of 3.x at least - this is our stable branch...I can't imagine this optimization belongs in our stable branch given all of this discussion...

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

its all yours do whatever you think needs to be done. have fun ;)

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

None of this is reasonable.

your unreasonable comments here are totally counter productive IMO. Just my $0.05

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Isn't there a lot of middle ground here? Why don't we just back out of 3.x for now and keep pushing towards a consensus implementation on trunk?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

FYI - I backed out the changes from 3.x

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Showuld we send an email to java-user as the index format in the stable branch changed by this (indexes with new CFS files can no longer be read)?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Showuld we send an email to java-user as the index format in the stable branch changed by this (indexes with new CFS files can no longer be read)?

I will do

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a rough patch against trunk that detaches CFDirectory from other dir impls but still extends Directory. all optimizations still remain while always the top-level directory is used to create / open the CFS. I didn't follow uwes last approach since I could figure out when to close the RAF reference since like in the MMap case we want to forcefully unmap the file and therefor would also need to close the RAF reference in the base stream. I use a helper construct (IndexInputHandle - yes need to find a better name) that can only be pulled from another directory (protected). So this is private to the directory impls but solves our cases here nicely I think.

still rough but its a start.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Simon,

thanks for taking care. This looks really nice and easier to understand. I agree, the problem with the RAF open file is hard to manage (especially when to close it).

One small suggestion: Currently the CFS file is opened twice: One time to read the contents and a second time to read the actual files using the handle (and for new format to read the CFE file, but thats unavoidable - once we nuke old index support in Lucene 5, we can always open the cfe first and read the contents, but until then we need to do both). Why not open the IndexInputHandle at the beginning and then simply request a full slice for the directory initialization (or ideally only that part that contains the directory)? The slice can then be closed afterwards as before.

So very cool work! Greetings from Berkeley!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This looks really nice and easier to understand.

I agree! At a glance, it seems this design looks much better to and avoids the sneaky delegation problems.

Only one minor thing glancing at the patch: in MockDirectoryWrapper, can we track that the handle is actually closed?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

next iteration

removed 3.4 support for the current format (cfe file) in SI
regenerated the BW test indices (not in the patch)
add javadoc for IndexInputHandle
added IndexInputHandle#openFullSlice to get a slice spanning the entire file.
Track indexInputHandle instances in MockDirectoryWrapper to ensure they are closed.
Use the IndexInputHandle ie. the underlying file handle to create all streams in CFS (uwes suggestion - thanks for that)

I didn't include the generated indices for bw tests in the patch for size / readability. Yet, if you want to run the tests you need to generate them otherwise TestBackwardsCompatibility will fail.

this seems close, the question is if we want to backport this to 3.x too?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This approach looks nice! Maybe rename IndexInputHandle to IndexInputProvider? IndexInputSlicer? SliceCreator?

Maybe rename CSIndexInput -> SlicedIndexInput?

In SimpleFSDir we may as well move that static Descriptor class out? Rather than having to import it to itself.

apache / lucene

Make CFS appendable [LUCENE-3218] #4291

4274

4289