Add Parquet output - Githubissues

shani-gold commented 1 year ago

Parquet is a columnar storage file format that is commonly used in the context of big data processing frameworks, such as Apache Spark and Apache Hive. The format is designed to be highly efficient for both storage and processing, especially in scenarios involving large-scale data analytics. Here are some reasons why Parquet is often used:

Columnar Storage: Parquet stores data in a columnar format, which means that values from the same column are stored together. This storage layout is more efficient for analytics queries that often involve reading specific columns rather than entire rows.

Compression: Parquet supports various compression algorithms, enabling efficient use of storage space. It reduces the amount of disk space needed to store large datasets, making it cost-effective.

Predicate Pushdown: Some query engines, like Apache Spark, can take advantage of predicate pushdown with Parquet. This means that certain filter conditions can be pushed down to the storage layer, minimizing the amount of data that needs to be read during query execution.

Schema Evolution: Parquet supports schema evolution, allowing you to evolve your data schema over time without requiring modifications to existing data or affecting backward compatibility.

Compatibility with Big Data Ecosystem: Parquet is widely used in the big data ecosystem, and many big data processing frameworks have built-in support for reading and writing Parquet files. This makes it easier to integrate Parquet with existing data processing workflows.

Performance: Due to its columnar storage and other optimizations, Parquet can offer improved performance for analytics queries, especially when dealing with large datasets.

When working with large-scale data analytics, Parquet can be a suitable choice for storing and processing your data efficiently. It provides benefits in terms of storage space, query performance, and compatibility with popular big data tools and frameworks. However, the choice of file format depends on your specific use case and the tools you are using in your data processing pipeline.

rafaeldtinoco commented 1 year ago

Hello @shani-gold,

This isn't currently planned by us in the short term, but it wouldn't be complicated in being implemented. Is this something you're willing to work on ?

If you check pkg/printer/printer.go you will find multiple printer flavors, like json, gob, table. You could implement a parquet printer there by, perhaps, following the json printer (and possibly converting json to parquet ? Its likely that parquet data schema would have to follow our json schema so it wouldn't brake very often.

Also, be aware that we're currently changing the "Tracee" event structure (https://github.com/aquasecurity/tracee/issues/2870), that would mean that the data schema would have to mimic the new one.

Hope it helps for now, and we can keep this opened if you're not doing it and someone else is willing to (or our priority changes).

Thanks!

shani-gold commented 1 year ago

Hi Rafael, I already implemented it :) https://github.com/aquasecurity/tracee/pull/3685

rafaeldtinoco commented 1 year ago

Hi Rafael, I already implemented it :) #3685

Oh, that was easy on my side =D. Ok, Ill try to review soon! Thanks for work! Excited to check it.

rafaeldtinoco commented 1 year ago

@shani-gold would you mind sharing the use case and project ? Just by curiosity. I'm interested in your use case, are you doing event OLAP processing ? Is that the reason why you wanted to provide such feature ?

shani-gold commented 1 year ago

It's for a profiler

aquasecurity / tracee

Add Parquet output #3682