apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.74k stars 1.04k forks source link

change file format documentation from "bit-for-bit" to highlevel [LUCENE-2946] #4020

Closed asfimport closed 12 years ago

asfimport commented 13 years ago

While reviewing website docs in #3998, I noticed the the existing fileformats is going to be pretty hopeless for 4.0.

Before it described the format "bit-for-bit", but with flexible indexing this is somewhat silly (and who really wants a bit-for-bit explanation of some of the new formats!)

I think it would be much better to give a high-level overview, perhaps linking to javadocs or even source code for the low-level details.

We probably should delay this until 4.0 is really close in sight (since things are changing so fast) but we can go ahead and think about it some now.

For example:

Some of the things i mentioned here are probably optional, for instance I think its "enough" to give a high-level overview of StandardCodec, but I can't help but think that explaining some of the architecture will be useful for new developers.


Migrated from LUCENE-2946 by Robert Muir (@rmuir), resolved May 07 2012

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

+1 - this all sounds great.

Some of the things i mentioned here are probably optional, for instance I think its "enough" to give a high-level overview of StandardCodec, but I can't help but think that explaining some of the architecture will be useful for new developers.

+1 to still going into great detail for StandardCodec. I think doing this for one codec will be supremely useful, as I have found the files format page in the past.

asfimport commented 12 years ago

Tom Burton-West (migrated from JIRA)

+1 for the high level overview.

+1 for to still going into detail for StandardCodec. Going into details of one Codec (especially the default one) will help those of us who have some trouble reading source code and understanding how the specific implementation details fit into the big picture. I have certainly found both the high level and detailed level information in the existing file formats documentation helpful in understanding the trade-offs in addressing our issues with slow phrase queries and with billions of unique terms.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

we can go into detail, but we can't do bit-for-bit with even StandardCodec... its simply not feasible.

For the simple metadata files, and even stored fields and postings its fine (for now), but e.g. going bit-for-bit with packed integer compression of docvalues isnt very realistic, nor is try to explain how FSTs for blocktree are serialized.

Hell at my current pace I'll just be happy if we can even document all the different docvalues types give some general idea how they are encoded, or give a high-level explanation of the terms dictionary.

Even the existing "simple" metadata files are a pretty serious effort because most of the existing docs are wildly out of date.

I figure all of this is ok (i'm heavy committing) since we essentially have nothing today: just out of date useless docs :)

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

fileformats is updated for 4.0

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Robert!