djuang1 / parquet

Parquet extension for Mule 4.x
9 stars 5 forks source link

Parquet Extension for Mule 4.x

Mule SDK connector that provides the ability to read Parquet files into JSON or write Parquet files from Avro data.

Overview

Apache Parquet is a columnar data storage format, which provides a way to store tabular data column wise. Columns of same date-time are stored together as rows in Parquet format, so as to offer better storage, compression and data retrieval.

Using Parquet format has two advantages

Installation Instructions

  1. Clone the repo
  2. Deploy the connector to your local Maven repo mvn clean install
  3. Add the connector dependency to your project pom.xml file
<dependency>
    <groupId>com.dejim</groupId>
    <artifactId>parquet</artifactId>
    <version>1.0.24-SNAPSHOT</version>
    <classifier>mule-plugin</classifier>
</dependency>

Reporting Issues

You can report new issues at this link https://github.com/djuang1/parquet/issues.

Operations

Read Parquet - Stream

This operation allows you to read a parquet file from an InputStream (e.g. #[payload]) Data can be coming from S3 or other connector that provides Streaming instead of needing to read it from the file system. It returns the data back in JSON format.

Write Avro to Parquet - Stream

This operation allows you to write a parquet file to an InputStream (e.g. #[payload]). Instead of writing to disk, you can output the data directly to S3 or other connector that provides Streaming capabilities.

Read Parquet - File

This operation allows you to read a parquet file from a local file system. It returns the data back in JSON format.

Write Avro to Parquet - File

Writing data to a parquet file isn't a straightforward process. It requires a schema that needs to be defined around the data. This operation allows you leverage Avro format support in MuleSoft to format the data using DataWeave before writing it to a parquet file.

Author: Dejim Juang - dejimj@gmail.com
Last Update: October 22, 2022