Closed sharpe5 closed 10 months ago
Good point, marked your comment as an enhancement. Thanks for your contribution.
@sharpe5 Hi sharpe5, can you provide me with sample parquet files in LZ4 or other compression codec. I need them for testing usage.
Here you go:
type=blockStream,rowCount=1000,compression=LZ4.zip
GitHub accepts .zip files, so unzip the .parquet file. There should be 6 columns of random doubles, a few thousand rows.
Anything else, let me know!
C++ code to create said file (missing functions; demo only). Arrow Parquet library was installed using vcpkg. Compiles with MSVC and gcc.
void demo3()
{
using namespace std;
using namespace fmt;
using namespace System::Diagnostics;
print("Demo 3: Open a file, flush blocks of rows to it until done:\n");
{
print(" - Test:\n");
double r1 { drand() };
print(" - r1={}\n", r1);
}
//const int maxRows = 1'000'000;
const int maxRows = 500;
vector<tuple<double, double, double, double, double, double>> rows;
{
rows.reserve(maxRows);
print(" - Creating raw data:\n");
Stopwatch sw = Stopwatch::StartNew();
for (int i=0;i<maxRows;i++)
{
rows.push_back({drand(), drand(), drand(), drand(), drand(), drand()});
}
sw.Stop();
print(" - rows.size(): {}\n", rows.size());
print(" - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
}
shared_ptr<arrow::Table> arrowTable;
{
const vector<string> names ={"col1", "col2", "col3", "col4", "col5", "col6"};
print(" - Creating Parquet table:\n");
Stopwatch sw = Stopwatch::StartNew();
if (!arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rows, names, &arrowTable).ok())
{
// Error handling code should go here.
print(" - Error when creating table.\n");
return;
}
sw.Stop();
print(" - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
}
string filepath;
{
std::shared_ptr<arrow::io::FileOutputStream> outfile;
const string filename=format("type=blockStream,rowCount={},compression=LZ4.parquet",maxRows * 2); // As we are writing two chunks (see below).
print(" - Write Parquet table:\n");
Stopwatch sw = Stopwatch::StartNew();
PARQUET_ASSIGN_OR_THROW(outfile,arrow::io::FileOutputStream::Open(filename));
parquet::WriterProperties::Builder propertiesBuilder;
propertiesBuilder.compression(parquet::Compression::LZ4);
const auto properties = propertiesBuilder.build();
// https://stackoverflow.com/questions/45572962/how-can-i-write-streaming-row-oriented-data-using-parquet-cpp-without-buffering
auto arrow_output_stream = arrow::io::FileOutputStream::Open(filename, false);
std::unique_ptr<parquet::arrow::FileWriter> writer;
parquet::arrow::FileWriter::Open(*(arrowTable->schema()), ::arrow::default_memory_pool(), *arrow_output_stream, properties, parquet::default_arrow_writer_properties(), &writer);
const int chunkSize = static_cast<int>(rows.size());
writer->WriteTable(*arrowTable, chunkSize);
// Demonstrates writing data in blocks.
writer->WriteTable(*arrowTable, chunkSize);
writer->Close();
print(" - Compression: LZ4\n");
print(" - Block size: {}\n", chunkSize);
print(" - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
const string dir = System::IO::Directory::GetCurrentDirectoryAlt();
filepath = Path::Combine(dir, filename);
}
{
print(" - Output file: {}\n", filepath);
}
}
Close the issue since it's over years, will reopen the feature is in the roadmap.
Fantastic work on this utility, thanks for developing it!
I'm wondering if it would be possible to compile in support for LZ4 compression? It already supports Snappy. LZ4 is about 50% faster for compression compared to Snappy, so newer Parquet files may tend to use this instead of Snappy. I was wondering why it couldn't read the files in the archive, and it turns out this was the cause. I fixed the issue by migrating the data from Snappy to LZ4.
I believe that the latest version of Arrow/Parquet supports all compression codecs by default?