apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.04k stars 1.14k forks source link

Getting started guide for new users (who want to use DataFusion in their project) #7014

Open alamb opened 1 year ago

alamb commented 1 year ago

Is your feature request related to a problem or challenge?

If we want to have DataFusion used as the core of many new systems, we need it to be as easy as possible for someone to get their idea working on top of DataFusion.

The current user guide I think helps setup the basics of the project and get a "hello world" style program going but then kind of leave the reader in a "now what" type situation: https://arrow.apache.org/datafusion/user-guide/example-usage.html

Describe the solution you'd like

I would like a document, perhaps similar in style to the polars user guide: https://pola-rs.github.io/polars-book/user-guide/

This User Guide is an introduction to the Polars DataFrame library. Its goal is to introduce you to Polars by going through examples and comparing it to other solutions. Some design choices are introduced here. The guide will also introduce you to optimal usage of Polars.

Basically I am thinking of something that would have helped @BubbaJoe get up to speed

The examples directory holds a bunch of examples: https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples

Potential outline:

Describe alternatives you've considered

No response

Additional context

This idea was suggested by @MrPowers

alamb commented 1 year ago

If someone wanted to help out the DataFusion project helping with this one would be awesome. A good first step would be to make the skeleton of the topics above in https://github.com/apache/arrow-datafusion/tree/main/docs and leave placeholder text (like "Coming Soon")

Then we can work together on writing the content in a few different PRs

MrPowers commented 1 year ago

This sounds great, really excited!

We'll either want two user guides or one user guide that's half in Python / half in Rust.

I guess that 99% of the users that want to query data via an API will want to do so in SQL / Python. The Python DataFrame user guide is way more important than the Rust one.

Users leveraging DataFusion to build tools for other engines (e.g. delta-rs) are much more likely to be using Rust.

Perhaps we divide the documentation as follows:

I don't think we should invest in building out the DataFusion Rust DataFrame API docs yet because it's a lower ROI activity. We should build a URL structure that allows for this however.

alamb commented 1 year ago

The Python DataFrame user guide is way more important than the Rust one.

I agree this is more important for "end users" rather than developers who are building with Rust

Perhaps we divide the documentation as follows:

That sounds great -- I filed https://github.com/apache/arrow-datafusion-python/issues/432 to track the work for the python bindings

alamb commented 1 year ago

I filed a bunch of tickets for follow on work and update the description of this ticket https://github.com/apache/arrow-datafusion/issues/7302 https://github.com/apache/arrow-datafusion/issues/7304 https://github.com/apache/arrow-datafusion/issues/7305 https://github.com/apache/arrow-datafusion/issues/7306 https://github.com/apache/arrow-datafusion/issues/7307 https://github.com/apache/arrow-datafusion/issues/7308