Eugene-Mark / bigdata-file-viewer

A cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.
GNU General Public License v2.0
282 stars 54 forks source link

Request support for LZ4 compression? #3

Closed sharpe5 closed 10 months ago

sharpe5 commented 4 years ago

Fantastic work on this utility, thanks for developing it!

I'm wondering if it would be possible to compile in support for LZ4 compression? It already supports Snappy. LZ4 is about 50% faster for compression compared to Snappy, so newer Parquet files may tend to use this instead of Snappy. I was wondering why it couldn't read the files in the archive, and it turns out this was the cause. I fixed the issue by migrating the data from Snappy to LZ4.

I believe that the latest version of Arrow/Parquet supports all compression codecs by default?

Eugene-Mark commented 4 years ago

Good point, marked your comment as an enhancement. Thanks for your contribution.

Eugene-Mark commented 4 years ago

@sharpe5 Hi sharpe5, can you provide me with sample parquet files in LZ4 or other compression codec. I need them for testing usage.

sharpe5 commented 4 years ago

Here you go:

type=blockStream,rowCount=1000,compression=LZ4.zip

GitHub accepts .zip files, so unzip the .parquet file. There should be 6 columns of random doubles, a few thousand rows.

Anything else, let me know!

sharpe5 commented 4 years ago

C++ code to create said file (missing functions; demo only). Arrow Parquet library was installed using vcpkg. Compiles with MSVC and gcc.

void demo3()
{
    using namespace std;
    using namespace fmt;
    using namespace System::Diagnostics;

    print("Demo 3: Open a file, flush blocks of rows to it until done:\n");

    {
        print("  - Test:\n");
        double r1 { drand() };
        print("    - r1={}\n", r1);
    }

    //const int maxRows = 1'000'000;
    const int maxRows = 500;
    vector<tuple<double, double, double, double, double, double>> rows; 
    {   
        rows.reserve(maxRows);

        print("  - Creating raw data:\n");
        Stopwatch sw = Stopwatch::StartNew();
        for (int i=0;i<maxRows;i++)
        {
            rows.push_back({drand(), drand(), drand(), drand(), drand(), drand()});
        }
        sw.Stop();
        print("    - rows.size(): {}\n", rows.size());
        print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
    }

    shared_ptr<arrow::Table> arrowTable;
    {
        const vector<string> names ={"col1", "col2", "col3", "col4", "col5", "col6"};       
        print("  - Creating Parquet table:\n");
        Stopwatch sw = Stopwatch::StartNew();
        if (!arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rows, names, &arrowTable).ok()) 
        {
            // Error handling code should go here.
            print("    - Error when creating table.\n");
            return;
        }
        sw.Stop();
        print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
    }

    string filepath;
    {
        std::shared_ptr<arrow::io::FileOutputStream> outfile;
        const string filename=format("type=blockStream,rowCount={},compression=LZ4.parquet",maxRows * 2); // As we are writing two chunks (see below).

        print("  - Write Parquet table:\n");
        Stopwatch sw = Stopwatch::StartNew();
        PARQUET_ASSIGN_OR_THROW(outfile,arrow::io::FileOutputStream::Open(filename));

        parquet::WriterProperties::Builder propertiesBuilder;
            propertiesBuilder.compression(parquet::Compression::LZ4);
            const auto properties = propertiesBuilder.build();

        // https://stackoverflow.com/questions/45572962/how-can-i-write-streaming-row-oriented-data-using-parquet-cpp-without-buffering
        auto arrow_output_stream = arrow::io::FileOutputStream::Open(filename, false);
        std::unique_ptr<parquet::arrow::FileWriter> writer;
        parquet::arrow::FileWriter::Open(*(arrowTable->schema()), ::arrow::default_memory_pool(), *arrow_output_stream, properties, parquet::default_arrow_writer_properties(), &writer);

        const int chunkSize = static_cast<int>(rows.size()); 
        writer->WriteTable(*arrowTable, chunkSize);     
                // Demonstrates writing data in blocks.
        writer->WriteTable(*arrowTable, chunkSize);
        writer->Close();

        print("    - Compression: LZ4\n");
        print("    - Block size: {}\n", chunkSize);
        print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
        const string dir = System::IO::Directory::GetCurrentDirectoryAlt();
        filepath = Path::Combine(dir, filename);
    }

    {
        print("  - Output file: {}\n", filepath);
    }
}
Eugene-Mark commented 10 months ago

Close the issue since it's over years, will reopen the feature is in the roadmap.