apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C++][Parquet] MAX_VALUES_PER_LITERAL_RUN causes RLE encoding failure #42509

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

The following code works for NUM_TO_ENCODE <= 400, but fails greater than that with the error:

Check failed: (encoded) == (num_bufferedvalues)

It appears to have to do with how large of an RLE buffer is allocated for buffering, causing Put to fail in levels.cc:78, but there doesn't seem to be recovery from that, or any error indicating what the problem is. I'm assuming MAX_VALUES_PER_LITERAL_RUN is somehow derived from the Parquet spec, but if so, it seems that there ought to be an exception or something generated. This could also be the basis of a writer example.

// Licensed to the Apache Software Foundation (ASF) under one // or more contributor license agreements. See the NOTICE file // distributed with this work for additional information // regarding copyright ownership. The ASF licenses this file // to you under the Apache License, Version 2.0 (the // "License"); you may not use this file except in compliance // with the License. You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, // software distributed under the License is distributed on an // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License.

include

include

include

include <parquet/api/writer.h>

using namespace parquet;

int main(int argc, char** argv) { if (argc != 2) { std::cerr << "Usage: " << argv[0] << " " << std::endl; return -1; }

std::string filename = argv[1];

try { const int NUM_TO_ENCODE = 400; std::shared_ptr ostream(new LocalFileOutputStream(filename)); parquet::schema::NodeVector fields; parquet::schema::NodePtr schema;

fields.push_back(parquet::schema::Int32("id", Repetition::REQUIRED));
fields.push_back(parquet::schema::ByteArray("name", Repetition::OPTIONAL));

schema = parquet::schema::GroupNode::Make("schema", Repetition::REPEATED, fields);

std::unique_ptr<ParquetFileWriter> writer = ParquetFileWriter::Open(ostream, std::dynamic_pointer_cast<parquet::schema::GroupNode>(schema));

RowGroupWriter\* rgBlock = writer->AppendRowGroup(NUM_TO_ENCODE);
ColumnWriter\* colBlock = rgBlock->NextColumn();
Int32Writer\* intWriter = static_cast<Int32Writer\*>(colBlock);
std::vector<int32_t> intbuf;
std::vector<int16_t> defbuf;
std::vector<ByteArray> strbuf;
for (int i = 0; i < NUM_TO_ENCODE; ++i) {
    intbuf.push_back( i );
    if (i % 10 == 0) {
        defbuf.push_back(0);
    } else {
        defbuf.push_back(1);
        uint8_t\* buf = new uint8_t[4];
        ByteArray ba;
        sprintf((char\*)buf,"%d",i);
        ba.ptr = buf;
        ba.len = strlen((const char\*)ba.ptr);
        strbuf.push_back(ba);
    }
}
intWriter->WriteBatch(intbuf.size(), nullptr, nullptr, intbuf.data());
intWriter->Close();
colBlock = rgBlock->NextColumn();
ByteArrayWriter\* strWriter = static_cast<ByteArrayWriter\*>(colBlock);
std::cerr << "sizes: strings:" << strbuf.size() << " definitions: " << defbuf.size() << std::endl;
strWriter->WriteBatch(defbuf.size(), defbuf.data(), nullptr, strbuf.data());
strWriter->Close();

} catch (const std::exception& e) { std::cerr << "Parquet error: " << e.what() << std::endl; return -1; }

return 0; }

Environment: Mac OSX Reporter: Mark Schaefer Assignee: Wes McKinney / @wesm

Note: This issue was originally created as PARQUET-676. Please see the migration documentation for further details.

asfimport commented 8 years ago

Wes McKinney / @wesm: It looks like the implementation of LevelEncoder::MaxBufferSize may be incorrect if it is not resulting in a large enough buffer being allocated.

https://github.com/apache/parquet-cpp/commit/35c8eb54aadcc18057b25db0cc6fd22239dee908#diff-33bc8df68e71e99c391f762f8e397488

@xhochy can you take a look at this to see what might be going wrong?

asfimport commented 8 years ago

Wes McKinney / @wesm: [~mschaefe] see my exposition on a bug report for the same problem in PARQUET-698. I'm leaving both JIRAs open until we confirm that both issues are caused by the same bug (it seems so, but until we know...).

EDIT: My previous comment was incorrect. I'm not in a place where I can work on a test case or patch, so it will be earliest next week before I can take a look at this more closely / do the math on paper.

asfimport commented 8 years ago

Uwe Korn / @xhochy: Had a look at this as I also run into this also today. Currently we seem to write to many literal indicators. Replacing if (++num_buffered_values_ == 8) with if (++num_buffered_values_ == MAX_VALUES_PER_LITERAL_RUN) fixed it. I will try to push a fix tomorrow once I have a reproducing unit test.

asfimport commented 8 years ago

Wes McKinney / @wesm: The current code there is OK (it only buffers a group of 8 values for a literal run before bitpacking them), the literal indicator is only written when FlushLiteralRun is called with true

asfimport commented 8 years ago

Wes McKinney / @wesm: It might be worth diffing this code with the current Impala version of this to see where there is some divergence (it would seem weird if this bug existed there and had never surfaced)

asfimport commented 8 years ago

Wes McKinney / @wesm: OK I've figured out the problem, I'll submit a patch at some point today

asfimport commented 8 years ago

Wes McKinney / @wesm: PR: https://github.com/apache/parquet-cpp/pull/150

asfimport commented 8 years ago

Wes McKinney / @wesm: Issue resolved by pull request 150 https://github.com/apache/parquet-cpp/pull/150