|Banner|
|CI| |PyPI| |Latest Tag| |Coverage| |Slack|
Website <https://www.getdaft.io>
• Docs <https://www.getdaft.io/projects/docs/>
• Installation
• 10-minute tour of Daft <https://www.getdaft.io/projects/docs/en/latest/learn/10-min.html>
• Community and Support <https://github.com/Eventual-Inc/Daft/discussions>
_
Daft <https://www.getdaft.io>
_ is a distributed query engine for large-scale data processing using Python or SQL, implemented in Rust.
Apache Arrow <https://arrow.apache.org/docs/index.html>
_ In-Memory FormatRecord-setting <https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io>
_ I/O performance for integrations with S3 cloud storageTable of Contents
About Daft
_Getting Started
_Benchmarks
_Related Projects
_License
_Daft was designed with the following principles in mind:
Ray <https://www.ray.io>
_ for running dataframes on large clusters of machines with thousands of CPUs/GPUs.Installation ^^^^^^^^^^^^
Install Daft with pip install getdaft
.
For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide <https://www.getdaft.io/projects/docs/en/latest/install.html>
_
Quickstart ^^^^^^^^^^
Check out our 10-minute quickstart <https://www.getdaft.io/projects/docs/en/latest/learn/10-min.html>
_!
In this example, we load images from an AWS S3 bucket's URLs and resize each image in the dataframe:
.. code:: python
import daft
# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")
# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())
# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))
df.show(3)
|Quickstart Image|
|Benchmark Image|
To see the full benchmarks, detailed setup, and logs, check out our benchmarking page. <https://www.getdaft.io/projects/docs/en/latest/faq/benchmarks.html>
_
More Resources ^^^^^^^^^^^^^^
10-minute tour of Daft <https://www.getdaft.io/projects/docs/en/latest/learn/10-min.html>
_ - learn more about Daft's full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.User Guide <https://www.getdaft.io/projects/docs/en/latest/user_guide/index.html>
_ - take a deep-dive into each topic within DaftAPI Reference <https://www.getdaft.io/projects/docs/en/latest/api_docs/index.html>
_ - API reference for public classes/functions of DaftTo start contributing to Daft, please read CONTRIBUTING.md <https://github.com/Eventual-Inc/Daft/blob/main/CONTRIBUTING.md>
_
Here's a list of good first issues <https://github.com/Eventual-Inc/Daft/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22>
_ to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!
To help improve Daft, we collect non-identifiable data.
To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0
The data that we collect is:
Please see our documentation <https://www.getdaft.io/projects/docs/en/latest/faq/telemetry.html>
_ for more details.
.. image:: https://static.scarf.sh/a.png?x-pxid=cd444261-469e-473b-b9ba-f66ac3dc73ee
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Dataframe | Query Optimizer | Multimodal | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
+===================================================+=================+===============+=============+=================+=============================+=============+
| Daft | Yes | Yes | Yes | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Pandas <https://github.com/pandas-dev/pandas>
| No | Python object | No | optional >= 2.0 | Some(Numpy) | No |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Polars <https://github.com/pola-rs/polars>
| Yes | Python object | No | Yes | Yes | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Modin <https://github.com/modin-project/modin>
| Eagar | Python object | Yes | No | Some(Pandas) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Pyspark <https://github.com/apache/spark>
| Yes | No | Yes | Pandas UDF/IO | Pandas UDF | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
| Dask DF <https://github.com/dask/dask>
_ | No | Python object | Yes | No | Some(Pandas) | Yes |
+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+
Check out our dataframe comparison page <https://www.getdaft.io/projects/docs/en/latest/faq/dataframe_comparison.html>
_ for more details!
Daft has an Apache 2.0 license - please see the LICENSE file.
.. |Quickstart Image| image:: https://github.com/Eventual-Inc/Daft/assets/17691182/dea2f515-9739-4f3e-ac58-cd96d51e44a8 :alt: Dataframe code to load a folder of images from AWS S3 and create thumbnails :height: 256
.. |Benchmark Image| image:: https://github-production-user-asset-6210df.s3.amazonaws.com/2550285/243524430-338e427d-f049-40b3-b555-4059d6be7bfd.png :alt: Benchmarks for SF100 TPCH
.. |Banner| image:: https://github.com/user-attachments/assets/ac676800-b799-454e-a6e0-9a58974a4154 :target: https://www.getdaft.io :alt: Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying
.. |CI| image:: https://github.com/Eventual-Inc/Daft/actions/workflows/python-package.yml/badge.svg :target: https://github.com/Eventual-Inc/Daft/actions/workflows/python-package.yml?query=branch:main :alt: Github Actions tests
.. |PyPI| image:: https://img.shields.io/pypi/v/getdaft.svg?label=pip&logo=PyPI&logoColor=white :target: https://pypi.org/project/getdaft :alt: PyPI
.. |Latest Tag| image:: https://img.shields.io/github/v/tag/Eventual-Inc/Daft?label=latest&logo=GitHub :target: https://github.com/Eventual-Inc/Daft/tags :alt: latest tag
.. |Coverage| image:: https://codecov.io/gh/Eventual-Inc/Daft/branch/main/graph/badge.svg?token=J430QVFE89 :target: https://codecov.io/gh/Eventual-Inc/Daft :alt: Coverage
.. |Slack| image:: https://img.shields.io/badge/slack-@distdata-purple.svg?logo=slack :target: https://join.slack.com/t/dist-data/shared_invite/zt-1t44ss4za-1rtsJNIsQOnjlf8BlG05yw :alt: slack community