apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.33k stars 1.2k forks source link

Datafusion Native llm.txt #13501

Open ChristianCasazza opened 1 day ago

ChristianCasazza commented 1 day ago

Is your feature request related to a problem or challenge?

LLMs provide a fantastic way to learn and use a new codebase. By providing the documentation, they can create a custom guide for new users to directly learn how to use a library or API, answer specific questions, and help address bugs with custom implementation of the library. One trend that is emerging because of this paradigm is to create two versions of documentation. One version is the traditional version and built for humans, so that it separates the different parts of the codebase into separate parts that is easy for a human to view. The other version of the docs would be optimized for LLMs. It would exist as a single markdown file that removes all of the human formatting and puts all of the information in one central file that can easily be copy and pasted into an LLM chat, which could then teach the user the library.

DuckDB recently released their version of an llm.txt here. it is basically one huge markdown file that includes all of their documentation, totaling about 700k tokens. Caleb Fahlgren from Huggingface extracted the data into an organized version here

I would like to propose making a Datafusion version of these LLM docs

Describe the solution you'd like

I would propose making a few versions of the llm.txt for datafusion. We should make one version that includes all of the docs in one large md file. This is a strong start. However, it is likely that it will be so large that doesn't fit within common chat LLMs context windows. Therefore, I would also suggest making smaller versions of the different sections of the docs, such as architecture, API, etc. The individual sections could make it easier for a user to selectively give the specific context for a particular part of datafusion they are working with, so as not to overload the LLM context.

After some trial and error, I would also suggest creating fully LLM optimized versions. These versions would include a mixture of conceptual explanations of datafusion, example code snippets, and the raw API interface. The goal for the final versions would be simple templates that can be copy and pasted into a chat, which then primes the LLM to have the context of the latest version of datafusion, along with the knowledge of working examples.

Describe alternatives you've considered

One alternative to LLM docs that is popular are companies just simply creating their own chat bot that has the context of their docs. While this is useful, I think it misses the point. I believe in the future it can be assumed that developers are already paying for their own LLM, through chat(ChatGPT, Claude), IDEs(Cursor), or their own set-up working with the LLM APIs.

Therefore, I don't think it is the best model for each company to have their own chatbot LLM. It becomes difficult for a user to combine context across different libraries they use together, and they must iterate in a companies chosen interface instead of the LLM interface they are already comfortable with.

Instead, it would be better to provide the raw context and allow users to bring it into the LLM interface they are already using.

Additional context

I think LLM-paired development is the future of data engineering, and having first class LLM support is vital for the adoption of datafusion. As an example, we could look at pandas and polars. Even though polars offers massive improvements over pandas, there is an order of magnitude more public code examples of pandas than polars. Therefore, LLMs will often suggest pandas code first and often create better working code compared to polars. Even though the underlying library of polars is better, I think many new developers will just use whatever works best with LLMs. I believe this is part of why DuckDB has been very popular, as LLMs are already good at creating SQL compared to dataframes.

By creating first class LLM support for datafusion, I think it can be positioned to gain developer mindshare as modern Arrow based engines become common sense to use.

timsaucer commented 1 day ago

This sounds great. I suspect we would want to do something also in the datafusion-python repository.

tbar4 commented 1 day ago

Thirded for Ballista Docs

2010YOUY01 commented 1 day ago

This is a great idea.

I would propose making a few versions of the llm.txt for datafusion. We should make one version that includes all of the docs in one large md file. This is a strong start. However, it is likely that it will be so large that doesn't fit within common chat LLMs context windows.

I've been using Cursor+Claude for a while, and its code generation for DataFusion is shockingly good. And it's using RAG to index everything, so context length is likely not a problem for Cursor, I'm curious if adding more indexable documents can make it better For example this reading list with very good quality https://datafusion.apache.org/user-guide/concepts-readings-events.html