apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.03k forks source link

More options for stored fields compression [LUCENE-5914] #6976

Closed asfimport closed 9 years ago

asfimport commented 10 years ago

Since we added codec-level compression in Lucene 4.1 I think I got about the same amount of users complaining that compression was too aggressive and that compression was too light.

I think it is due to the fact that we have users that are doing very different things with Lucene. For example if you have a small index that fits in the filesystem cache (or is close to), then you might never pay for actual disk seeks and in such a case the fact that the current stored fields format needs to over-decompress data can sensibly slow search down on cheap queries.

On the other hand, it is more and more common to use Lucene for things like log analytics, and in that case you have huge amounts of data for which you don't care much about stored fields performance. However it is very frustrating to notice that the data that you store takes several times less space when you gzip it compared to your index although Lucene claims to compress stored fields.

For that reason, I think it would be nice to have some kind of options that would allow to trade speed for compression in the default codec.


Migrated from LUCENE-5914 by Adrien Grand (@jpountz), 5 votes, resolved Dec 11 2014 Attachments: LUCENE-5914.patch (versions: 8) Linked issues:

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

What I have been thinking about would be to provide 2 options for compression (we could have more than 2 but that would make it more complicated backward-compatibility wise):

Here is how the current patch tries to address these requirements:

  1. For high compression, documents are grouped into blocks of 16KB (pretty much like today) but instead of being compressed with LZ4, they are compressed with deflate and a low compression level (3 which is the highest level that doens't use lazy match evaluation, I think it is a good trade-off for our stored fields).

If you want to decompress a document, you need to decompress the whole block.

  1. For better search speed, documents are compressed individually with lz4. In order to keep the compression ratio good enough, documents are still grouped into blocks, and what happens is that the data that results from the compression of the previous documents in the block are used as a dictionary in order to compress the current document.

When you want to decompress, you can decompress a single document at a time, all that you need to is to have a buffer that stores the compressed representation of the previous documents in the block so that the decompression routine can make references to it.

In both cases, I tried to implement it in such a way that it is not required to override the default bulk merge API in order to get good merging performance: the readers keep some state that allow them to read documents sequentially. This should also help for operations like exports of the indices since they would get much better performance when iterating over documents in order.

The patch is not ready yet, it is too light on tests so that I can sleep quietly, and quite inefficient on large documents. For now I'm just sharing it in order to get some feedback. :-)

For the shared dictionary logic, I looked at other approaches that didn't work out well:

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Haven't looked at the patch I confess, but is there a way to turn compression off completely? I've seen a few situations in the wild where compressing/decompressing is taking up large amounts of CPU.....

FWIW, Erick

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Haven't looked at the patch I confess, but is there a way to turn compression off completely?

+1.

From what I understand, this would be relatively straightforward in Lucene, where you can swap low-level components in and out pretty easily, but I'm really looking for that option to be user-configurable in Solr. I know that will require a separate issue. For my index, compression is a good thing, but like Erick, I've seen situations with Solr on the list and IRC where it really hurts some people.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The patch adds conditional logic to the default codec, instead of different formats. Why this approach?

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

@ErickErickson @elyograg Do you have pointers to emails/irc logs describing such issues? Maybe that is something that can be solved without disabling compression? By the way, the current patch already improves CPU usage in two cases: when doing random access since you can decompress a single document at a time, and also sequential access (typically used if you export your current dataset using stored fields, or internally by Lucene when merging mixed codecs) so maybe that would already help.

@rmuir I will change this.

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

@jpountz well I just was curious about the motivation behind it.

There are advantages and disadvantages to each way, but as separate formats each use case would really be able to proceed in its own direction in the future.

With the current patch, BEST_COMPRESSION = Deflate, but what if we wanted to replace this with bzip later and still support the old indexes etc (which I think is a goal of this issue).

So I think its better to have separate formats (if they want to share some code behind the scenes at the moment, thats ok). If we want to provide back compat on both options, then thats something we can decide to do here (IMO, there should be a "price" for the added complexity, such as moving all ancient codecs out of lucene/core, dropping 3.x index support, something to keep it all manageable).

asfimport commented 10 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

Do you have pointers to emails/irc logs describing such issues?

I've seen this issue on AWS small instances, where CPU "steal time" and not I/O can easily become the limiting factor.

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

These were some user's list discussions, can't lay my hands on them right now.

Whether there were work-arounds or not, I've always been uncomfortable with not having the option to turn off compression. There are just too many places where people do "non standard" things like store not-very-many huge documents to take that option away from them unless we can guarantee that in all configurations trading I/O for CPU is A Good Thing. And I don't believe we can make that guarantee.

That said, I'm not the one doing the coding so I don't know the ins and outs here. If it's easy to turn compression off then I think it's worth doing. If it's major surgery OTOH, I don't have any hot complaints to point to so insisting that you do the work because of a vaguely-remembered user list discussion (that I can't prove there was no work-around for) is just not in the cards ;). The response has been "write your own codec", so maybe the right option is to provide a non-compressing codec? Here's where you tell me there already is one ;)

Andrzej points out an interesting case though.

asfimport commented 10 years ago

Stefan Pohl (migrated from JIRA)

This is an interesting discussion based on awesome work, thanks Adrian!

From my experience, LZ4 can actually speed up reading (such as implemented here) and working on byte arrays if they have reasonable size, say 10k. There can however be some small overhead for short stored fields (e.g. if only an int is stored). It remains to be seen how this impl compares to no-compression.

On AWS instances as the ones @sigram refers to, wouldn't switching off index compression be even more helpful, especially after having per-doc decompression as implemented here? Lucene and most other IR engines always have been trading in CPU for IO/size because index compression is on per default. This doesn't mean though not to provide the option to users if a case for it can be made :)

asfimport commented 10 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This doesn't mean though not to provide the option to users if a case for it can be made

Its ok to provide such options without hesitation in the codecs/ module, however:

We have to be careful, this issue proposes supporting such options in the default codec.

This is a completely different thing. This means we support such formats for years and years and years. Currently as we speak we are trying to release 4.10, and it must still be able to read the 3.0 index format from 5 years ago (and we had to respin another release candidate because there were bugs in such support).

So that's why i point out, if we want to add an option (and i expect we can add at most 1 option here feasibly), then tradeoffs have to be made in our backwards compatibility support such that we can maintain this stuff and it does not spin out of control.

asfimport commented 10 years ago

Eks Dev (migrated from JIRA)

Do you have pointers to emails/irc logs describing such issues?

I do not know what the gold standard lucene usage is, but at least one use case I can describe, maybe it helps. I am not proposing anything here, just sharing experience.

Think about the (typical lucene?) usage with structured data (e.g. indexing relational db, like product catalog or such) with many smallish fields and then retrieving 2k such documents to post-process them, classify, cluster them or whatnot (e.g. mahout and co.)

Ideally we should enable to use biggish chunk_size during compression to improve compression and decompress only single document (not depending on chunk_size), just like you proposed here (if I figured it out correctly?)

Usually, such data is highly compressible (imagine all these low cardinality fields like color of something...) and even some basic compression does the magic.

What we did?

Conclusion: compression is great, and anything that helps tweak this balance (CPU effort / IO effort) in different phases indexing/retrieving smoothly makes lucene use case coverage broader. (e.g. "I want to afford more CPU during indexing, and less CPU during retrieval", static coder being extreme case for this...)

I am not sure I figured out exactly if and how this patch is going to help in a such cases (how to achieve reasonable compression if we do per document compression for small documents? Reusing dictionaries from previous chunks? static dictionaries... ).

In any case, thanks for doing the heavy lifting here! I think you already did really great job with compression in lucene.

PS: Ages ago, before lucene, when memory was really expensive, we had our own serialization (not in lucene) that simply had one static Huffman coder per field (with byte or word symbols), with code-table populated offline, that was great, simple option as it enabled reasonable compression for "slow changing collections" and really fast random access.

asfimport commented 10 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Ideally we should enable to use biggish chunk_size during compression to improve compression and decompress only single document (not depending on chunk_size), just like you proposed here (if I figured it out correctly?)

Exactly, this is one of the two proposed options. The only overhead would be that you would need to read the shared dictionary and have it in memory (but that is a single call to readBytes and its size can be controlled so that should be no issue).

Usually, such data is highly compressible (imagine all these low cardinality fields like color of something...) and even some basic compression does the magic.

Agreed, this is the reason why I'd prefer the "low-overhead" option to be something cheap rather than no compression at all: data usually has lots of patterns and even something as simple as LZ4 manages to reach interesting compression ratios.

Conclusion: compression is great, and anything that helps tweak this balance (CPU effort / IO effort) in different phases indexing/retrieving smoothly makes lucene use case coverage broader. (e.g. "I want to afford more CPU during indexing, and less CPU during retrieval", static coder being extreme case for this...)

I am not sure I figured out exactly if and how this patch is going to help in a such cases (how to achieve reasonable compression if we do per document compression for small documents? Reusing dictionaries from previous chunks? static dictionaries... ).

The trade-off that this patch makes is:

The current patch provides two options:

I'll try to explain the 2nd option better: it works well because lz4 mostly deduplicates sequences of bytes in a stream. So imagine that you have the following 3 documents in a block:

  1. abcdefghabcdwxyz
  2. abcdefghijkl
  3. abcdijklmnop

We will first compress document 1. Given that it is the first document in the block, there is no shared dictionary, so the compressed representation look like this (literals means that bytes are copied as-is, and ref means it is a reference to a previous sequence of bytes. This is how lz4 works, it just replaces existing sequences of bytes with references to previous occurrences of the same bytes. The more references you have and the longer they are, the better the compression ratio.).

<literals:abcdefgh><ref:abcd><literals:wxyz>

Now we are going to compress document 2. It doesn't contain any repetition of bytes, so if we wanted to compress it individually, we would just have <literals:abcdefghijkl> which doesn't compress at all (and is even slightly larger due to the overhead of the format). However, we are using the compressed representation of document1 as a dictionary, and "abcdefgh" exists in the literals, so we can make a reference to it!

<ref:abcdefgh><literals:ijkl>

And again for document3 using literals of document1 for "abcd", and literals of document2 for "ijkl":

<ref:abcd><ref:ijkl><literals:mnop>

asfimport commented 10 years ago

Eks Dev (migrated from JIRA)

lovely, thanks for explaining, I expected something like this but was not 100% sure without looking into code. Simply, I see absolutely nothing ono might wish from general, OOTB compression support...

In theory... The only meaningful enhancements to the standard are possible to come only by modelling semantics of the data (the user must know quite a bit about the distribution of the data) to improve compression/speed => but this cannot be provided by the core, (Lucene is rightly content agnostic), at most the core APIs might make it more or less comfortable, but imo nothing more.

For example (contrived as LZ4 would deal with it quite ok, just to illustrate), if I know that my field contains up to 5 distinct string values, I might add simple dictionary coding to use max one byte without even going to codec level. The only place where I see theoretical possibility to need to go down-dirty is if I would want to reach sub-byte representations (3 bits per value in example), but this is rarely needed/hard to beat default LZ4/deflate and also even harder not to make slow. At the end of a day, someone who needs this type of specialisation should be able to write his own per-field codec.

Great work, and thanks again!

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Here's an example of what I was remembering. I haven't verified, could be totally unrelated to compression. The use-case seems to be very large documents (up to 50M) indexing time going from 3 hours to 3 days. There may very well be stuff going on here that is unrelated, but...

Skipping the discussion about what the point of 50M docs is ;)

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201409.mbox/%3C27589A9FF2E7CC42B56140FD01CEC3A2EF794C56%40clrc-nt-mbx01p.corp.org.local%3E

asfimport commented 10 years ago

Ryan Ernst (@rjernst) (migrated from JIRA)

@ErickErickson That example doesn't appear to even have stored fields enabled?

<field name="majorTextSignalStem" type="text_stem" indexed="true" stored="false" multiValued="true" omitNorms="false"/>

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

@rjernst] Right, that particular field doesn't, I was assuming that other fields did.

Hmmm, let me ping him back and we'll see if we can get him to experiment for us.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is a new patch. Quick reminder of what it brings:

Compared to the previous version of the patch, we now have two standalone stored fields formats for both options, which makes testing easier, as well as a main Lucene50StoredFieldsFormat which delegates to these formats based on the result of Lucene50Codec.getStoredFieldsCompression.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Some updates:

We should try to do more cleanup:

I will look more tomorrow.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

nuked abort() in #7144 and removed the delegating writer here.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I will think more on the header check logic. On one hand it would be nice if we could figure out an idiom for CodecUtil.checkFooter(Throwable, Input...) that would be safe, but every option I think of is crazier than the current stuff.

Its also a bit wierd to be slurping in 3 read-once metadata files here. This adds complexity at read... the current format is simpler here with just a single file. Can we avoid this?

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Its also a bit wierd to be slurping in 3 read-once metadata files here. This adds complexity at read... the current format is simpler here with just a single file. Can we avoid this?

I tried to look into it, but it's not easy. Lucene41 has its own custom stored fields index, which is mostly the same thing as MonotonicBlockPackReader, so for this new codec, I wanted to move the index to MonotonicBlockPackReader.

The index for stored fields basically stores two pieces of information: the first doc ID for each block, and the first start pointer for each block. In Lucene41, blocks were interleaved, but this is not something that the MonotonicBlockPackWriter allows for, this is why there are 2 files: one for doc IDs and one for start pointers. Second limitation, at read time, you need to know up-front how many values the MonotonicBlockPackReader stores in order to be able to decode it. This is why we have a 3rd file for metadata that stores the number of documents in the segment upon call to StoredFieldsWriter.finish.

I agree having 3 read-only files might look strange, but it's probably better than having to re-implement specialized monotonic encoding?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Second limitation, at read time, you need to know up-front how many values the MonotonicBlockPackReader stores in order to be able to decode it. This is why we have a 3rd file for metadata that stores the number of documents in the segment upon call to StoredFieldsWriter.finish.

But this is redundant with SegmentInfo?

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Sorry, I meant number of blocks instead of number of documents.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Here is a new patch that iterates on Robert's:

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I opened #7147 for the SI.attributes, which should help with cleanup.

I ran some benchmarks on various datasets to get an idea where this is at, they are disappointing. For geonames, the new format increases size of the stored fields 50%, for apache http server logs, it doubles the size. Indexing time is significantly slower for any datasets i test as well: there must be bugs in the lz4+shared dictionary?

impl size index time force merge time
trunk 372,845,278 101,745 15,976
patch(BEST_SPEED) 780,861,727 141,699 60,114
patch(BEST_COMPRESSION) 265,063,340 132,238 53,561

To confirm its a bug and not just the cost of additional i/o (due to less compression with shared dictionaries), i set deflate level to 0, and indexed with the BEST_COMPRESSION layout to really jack up the size. Sure, it created a 1.8GB stored field file, but in 126,093ms with 44,377ms merging. This is faster than both the options in the patch...

Anyway, this leads to more questions:

About the oversharing issue: I really think the separate formats should just be separate formats, it will make life easier. Its more than just a difference in compression algorithm and we shouldn't try to structure things so that can just be swapped in, i think its not the right tradeoff.

For example, with high compression its more important to lay it out in a way where bulk-merge doesn't cause re-compressions, even if it causes 'temporary' waste along segment boundaries. This is important because compression here gets very costly, and for e.g. "archiving" case, bulk merge should be potent as there shouldnt be so many deletions: we shouldnt bear the cost of re-compressing over and over. This gets much much worse if you try to use something "better" than gzip, too.

On the other hand with low compression, we should ensure merging is still fast even in the presence of deletions: the shared dictionary approach is one way, another way is to just have at least the getMergeInstance() remember the current block and have "seek within block" optimization, which is probably simpler and better than what trunk does today.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I also want to propose a new way to proceed here. In my opinion this issue tries to do a lot at once:

I think its too scary to do all at once. I would prefer we start by exposing the current CompressionMode.HIGH_COMPRESSION as the "high compression" option. At least for that one test dataset i used above (2GB highly compressible apache server logs), this is reasonably competitive with the deflate option on this issue:

impl size index time force merge time
trunk_HC 275,262,504 143,264 49,030

But more importantly, HighCompressingCodec has been baking in our test suite for years, scary bugs knocked out of it, etc. I think we should first figure out the plumbing to expose that, its something we could realistically do for lucene 5.0 and have confidence in. There is still plenty to do there to make that option work: exposing the configuration option, addressing concerns about back compat testing (we should generate back compat indexes both ways), and so on. But at least there is a huge head start on testing, code correctness, etc: its baked.

For new proposed formats (LZ4 with shared dictionary, deflate, whatever), I think we should address each one individually, adding to the codecs/ package first / getting into tests / baking in a similar way... doesn't need to be years but we should split these concerns.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Applying the tuning parameters proposed for the deflate case here to the trunk code is a trivial and safe patch, and more compelling:

impl size index time force merge time
trunk_HC_level3_24576_512 269,857,651 118,150 28,313

edited: impl name to make it clear i also bumped maxDocsPerChunk in this case too.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

For the benchmarks, there are two reasons I think:

Let's move forward with your idea to reuse the lz4 hc format as a first step. I will try to improve the shared dictionary approach in a different issue.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

the benchmark I did here are for HighCompressingCodec aka CompressionMode.HIGH_COMPRESSION which is actually deflate, not lz4 hc. I benchmarked lz4 hc as well, but i didn't seem like the right tradeoff, slow writes but for not much benefit.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here is a patch using the current stuff as proposed. I didn't address the need for testing besides bare minimum (besides adding another BaseStoredFieldsFormatTestCase with high compression set). We should think about how to rotate nicely in tests, how to address back compat testing, and fix format docs too.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This looks like a good start! Nice that we can use segment attributes now!

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Updated patch:

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 thanks Robert. I am good with a constructor argument, the only reason why I initially added it as a protected method was to be consistent with postings and doc values formats.

Regarding the patch, it just feels weird to me to have this Objects.requireNonNull(mode) validation in the Lucene50Codec(Mode) constructor, I would have expeced it to be solely the responsibility of the Lucene50StoredFieldsFormat constructor?

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I want the null check robust to code changes. Both classes are public. I am afraid someone will remove them thinking it makes something faster.

We should add tests for this too.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I added explicit tests for both these null conditions. Logic is unchanged: we have explicit null check everywhere.

Lucene is bogus about null checks like this everywhere: it hurts nobody's performance to do it here and its silly to have something like an aborting exception in indexwriter lose somebody's documents when it could have been prevented.

I'm not going to be placed into a defensive posture where I must defend doing the right thing. IMO we can just not have this stored fields option, too. But if we are going to accept options which we must support with backwards compatibility in tthe default codec, then we should check the parameters.

asfimport commented 9 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 to the patch

Looks like my comment made you angry, but it was really just something I was wondering about. I'm fine with keeping things this way.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1643490 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1643490

LUCENE-5914: More options for stored fields compression

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1643491 from @rmuir in branch 'dev/branches/branch_5x' https://svn.apache.org/r1643491

LUCENE-5914: More options for stored fields compression

asfimport commented 9 years ago

Anshum Gupta (@anshumg) (migrated from JIRA)

Bulk close after 5.0 release.