apache / arrow-site

Mirror of Apache Arrow site
Apache License 2.0
31 stars 106 forks source link

Add iceburst to powered by list #474

Closed prasburst closed 4 months ago

prasburst commented 4 months ago

Included details about iceburst.io

prasburst commented 4 months ago

Hi,

This is done by iceburst which is one of our core value proposition.


From: Sutou Kouhei @.> Sent: Saturday, February 10, 2024 10:07:09 PM To: apache/arrow-site @.> Cc: prasburst @.>; Author @.> Subject: Re: [apache/arrow-site] Add iceburst to powered by list (PR #474)

@kou commented on this pull request.


In powered_by.mdhttps://github.com/apache/arrow-site/pull/474#discussion_r1485483982:

@@ -129,6 +129,10 @@ short description of your use case. natural language processing, and tabular tasks. Dataset objects are wrappers around Arrow Tables and memory-mapped from disk to support out-of-core parallel processing for machine learning workflows. +* [iceburst][53]: A real-time data lake for monitoring and security built

  • directly on top of Amazon S3. Our approach is simple: ingest the OpenTelemetry data in an S3 bucket as
  • Parquet files in Iceberg table format and query them using DuckDB with milliseond retrieval and zero egress cost.
  • Parquet is converted to Arrow format in-memory enhancing both speed and efficiency.

Is this done by DuckDB or iceburst? If you mean that DuckDB does it, it may be wrong. I think that DuckDB doesn't use Apache Arrow as its internal data format.

— Reply to this email directly, view it on GitHubhttps://github.com/apache/arrow-site/pull/474#pullrequestreview-1874307769, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BB5Q3G3G3I52NQQILIWX233YTBNY3AVCNFSM6AAAAABDDD7XTGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQNZUGMYDONZWHE. You are receiving this because you authored the thread.Message ID: @.***>

kou commented 4 months ago

Does iceburst use DuckDB's Arrow integration feature https://duckdb.org/2021/12/03/duck-arrow.html ?

prasburst commented 4 months ago

Yes, a lot of work is made easy because of the zero copy integration.

We export the query results to an Arrow table using the arrow function. Some cases, especially on aggregation queries made using the relational API of DuckDB, we use the to_arrow_table function to export the query results and save everything in Arrow format in-memory.

Here's a reference to Arrow export: https://duckdb.org/docs/guides/python/export_arrow