apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.06k stars 13.61k forks source link

[SIP-128] AI/LLM query generation in SQL lab #28167

Open surapuramakhil opened 5 months ago

surapuramakhil commented 5 months ago

Please make sure you are familiar with the SIP process documented here. The SIP will be numbered by a committer upon acceptance.

[SIP] Proposal for AI/LLM query generation in SQL lab </h2> <h3>Motivation</h3> <p>To make Apache superset dashboard/chats creation possible for Non Dev/SQL background users. <a rel="noreferrer nofollow" target="_blank" href="https://github.com/apache/superset/discussions/27272">https://github.com/apache/superset/discussions/27272</a></p> <h3>Proposed Change</h3> <p>Describe how the feature will be implemented, or the problem will be solved. If possible, include mocks, screenshots, or screencasts (even if from different tools).</p> <ol> <li>A prompt box on the top of current SQL editor box, Along with generate query button.</li> </ol> <p><img referrerpolicy="no-referrer" src="https://github.com/apache/superset/assets/9161543/2634d7a7-7088-45e2-9be4-b9f15c300ff5" alt="Screenshot 2024-04-21 at 3 33 17 AM" /> <strong>This is current SQL LAB used for showing SQL editor box</strong></p> <ol start="2"> <li> <p>Forward user prompts to LLM Model Along with other system prompts - which shares databases schema information (consider it as RAG) for quality prompt responses. (optional) Some additional query / prompts which are required for understanding data. Maybe sharing first 10 rows. (Or) distinct values for a column, etc. (whatever is necessary)</p> </li> <li> <p>populate the editor with the query generated by the model.</p> </li> </ol> <p><strong>Query Generation</strong> there are already pipelines in langchain for this <a rel="noreferrer nofollow" target="_blank" href="https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query">https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query</a>.</p> <p>we can use these pipelines for generating Queries from text. It works for any llm model. we can just piggyback on that. All I am planning to be have llm_provider or llm_factory which creates llm based on user needs and send to their pipeline.</p> <p>Few Technical Implementations / Considerations</p> <ul> <li>LLM access will be provided by configuring API Endpoint by Superset admin / owner. (this also included other necessary information like access & secret keys) (this can also be via langChain As i have seen some of the PR's have done this)</li> </ul> <p>LLM access as API would give choice whether they want to use existing services rather than deploying. Packaging LLM in superset deployment is not feasible</p> <ul> <li>OpenAI standard model interface has become pretty common and a lot of other models follow. (inputs required on this)</li> </ul> <p>Backend Architecture Diagram</p> <p><img referrerpolicy="no-referrer" src="https://github.com/apache/superset/assets/9161543/41e6ec4d-7cea-4fef-b6fa-700a38d15014" alt="image" /></p> <h3>New or Changed Public Interfaces</h3> <p>Describe any new additions to the model, views or <code>REST</code> endpoints. Describe any changes to existing visualizations, dashboards and React components. Describe changes that affect the Superset CLI and how Superset is deployed.</p> <h3>New dependencies</h3> <p>Describe any <code>npm</code>/<code>PyPI</code> packages that are required. Are they actively maintained? What are their licenses?</p> <h3>Migration Plan and Compatibility</h3> <p>Describe any database migrations that are necessary, or updates to stored URLs.</p> <h3>Rejected Alternatives</h3> <p>Describe alternative approaches that were considered and rejected.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/rusackas"><img src="https://avatars.githubusercontent.com/u/812905?v=4" />rusackas</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>We'll need to cover a lot of things that are missing from this SIP: • What packages/licenses are needed, and are the compatible? • What are the security/privacy implications? • How do we (as an open-source solution) stay vendor-agnostic here? What's the abstraction layer?</p> <p>This will need to be put up for a DISCUSS thread on the mailing list to move forward, but I think the proposal needs more detail/resolution.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/surapuramakhil"><img src="https://avatars.githubusercontent.com/u/9161543?v=4" />surapuramakhil</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>We'll need to cover a lot of things that are missing from this SIP: • What packages/licenses are needed, and are they compatible? Python langchain package (or) modules required for making HTTP calls</p> <p>• What are the security/privacy implications? User configures necessary API keys. LLM calls happen though backend, since Schema needs to be passed to RAG for quality responses. Either Approaches - support both options self-hosted (protecting security & privacy) or using provider of choice.</p> <p>• How do we (as an open-source solution) stay vendor-agnostic here? What's the abstraction layer? we can stay vendor-agnostic by leaving choice to the user with their preferred mode (self-hosted or LLM as service etc.) and also choice of LLM. What's the abstraction layer -- I have found <a href="https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query">https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query</a> in Langchain which we can directly use.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/surapuramakhil"><img src="https://avatars.githubusercontent.com/u/9161543?v=4" />surapuramakhil</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>Either two Options (draft) - feel free to add what you think. (probably I need to find a way of doing this as collaborative one).</p> <p>If maintainers can create a sheet that works, else I can create a spreadsheet. For evaluating or suggesting various approaches or implementation ideas</p> <ol> <li>Either use HTTP request - where users can configure endpoint. Or use sophisticated LLM FW like Langchain. Which helps when scaling in terms of functionality</li> </ol> <table> <thead> <tr> <th></th> <th>LLM access using LangChain</th> <th>LLM access using HTTP</th> </tr> </thead> <tbody> <tr> <td>Advantages</td> <td>Scalability while adding/extending features</td> <td>Little code - leaves LLM to end user</td> </tr> <tr> <td></td> <td>Supports lot of LLM's but switch might requried</td> <td>User configures HTTP endpoint - giving him choice of either self-hosted or LLM as servie where he can just configure end point.</td> </tr> <tr> <td>Congigurations</td> <td>user configures necessary API keys/options con work</td> <td>while user define http end point - he will configure headers</td> </tr> <tr> <td></td> <td></td> <td>Code solves for 1 particular usescases. extensibility is tough</td> </tr> <tr> <td></td> <td>levrage Langchain</td> <td>we might need to add code which FW's like langchain already does</td> </tr> <tr> <td></td> <td>Provider agnostic - as it supports Almost all providers</td> <td>Provider-agnostic - user configures endpoint</td> </tr> <tr> <td></td> <td>Less changes requried - as these are SDK's</td> <td>Request need to changed whenever releases happens etc</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <p><strong>Above table is a draft - dumping my thoughts</strong></p> <p>Based on my evaluation - use langchain is better - </p> <ol> <li>as I provided sophisticated FW while working with LLMs (Although it might not require imminently will definitely useful in future). </li> <li>Another concern I had with langchain was able to use self-Hosted LLM modes - it seems Lang chain supports that. It can be has selfHostedPipeline, or we can write generic LLM model (just a HTTP wrapper) for llm to access by langchain</li> </ol> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/geido"><img src="https://avatars.githubusercontent.com/u/60598000?v=4" />geido</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>@surapuramakhil thanks. I think it makes sense to update the description with all new info and make sure you are covering all the technical/architectural considerations. First question that comes to mind, how do you intend to pull the right metadata from the database for the LLM to use? There is a limited context window and you just can't pull the whole schema for both context and performance limitations.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/surapuramakhil"><img src="https://avatars.githubusercontent.com/u/9161543?v=4" />surapuramakhil</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>@geido based on my research langchain already solves this. <a href="https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query">https://python.langchain.com/docs/use_cases/sql/quickstart/#convert-question-to-sql-query</a>.</p> <p>They wrote pipelines for generating Queries from text. It works for any llm model. we can just piggyback on that. All I am planning to is have llm_provider or llm_factory which creats llm based on user needs and send to their pipeline.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/surapuramakhil"><img src="https://avatars.githubusercontent.com/u/9161543?v=4" />surapuramakhil</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>@geido as you said updated description.</p> <blockquote> <p>First question that comes to mind, how do you intend to pull the right metadata from the database for the LLM to use? There is a limited context window, and you just can't pull the whole schema for both context and performance limitations.</p> </blockquote> <p>Let's try with langchain and see its results. </p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/geido"><img src="https://avatars.githubusercontent.com/u/60598000?v=4" />geido</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>It looks more like a toy for now:</p> <blockquote> <p>Has definitions for all the available tables.</p> </blockquote> <p>This won't work for production databases that might have hundreds of tables and columns.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/geido"><img src="https://avatars.githubusercontent.com/u/60598000?v=4" />geido</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>I think having langchain in the repo might be a nice thing to have to enable LLM-related capabilities. However, that would be a separate SIP to illustrate how langchain could be leveraged in the repo. It looks like starting from SQL generation is hard.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/surapuramakhil"><img src="https://avatars.githubusercontent.com/u/9161543?v=4" />surapuramakhil</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <blockquote> <p>It looks like starting from SQL generation is hard.</p> </blockquote> <p>Why do you think so? It's the first use case which Apache superset needs</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/tseruga"><img src="https://avatars.githubusercontent.com/u/5778047?v=4" />tseruga</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>As someone who has actually implemented this exact idea in superset for a hackathon a few months back, this is a pipe-dream at best (to be fairly blunt). Using RAG to pull relevant table metadata at prompt-time still led to unmanageable levels of LLM hallucination that only grows worse as the size of the warehouse being queried increases.</p> <p>Something like this may be feasible for a user with a handful of tables, but at-scale it simply doesn't work. And a query that is 99% correct is functionally worthless if this is intended to be utilized by folks who don't have the skills necessary to parse through AI-generated SQL.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/surapuramakhil"><img src="https://avatars.githubusercontent.com/u/9161543?v=4" />surapuramakhil</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <blockquote> <p>like this may be feasible for a user with a handful of tables, but at-scale it simply doesn't work</p> </blockquote> <p>This is the problem with Language Model. That's exactly why LLM choice is given to users. If the situation were the scale is high, the best they can with high context size model like Gemini pro-1.5. Thats a separate Data Science problem which Apache Superset doesn't need to solve. just leverage what is available.</p> <hr /> <blockquote> <p>Using RAG to pull relevant table metadata at prompt-time still led to unmanageable levels of LLM hallucination that only grows worse as the size of the warehouse being queried increases.</p> </blockquote> <p>This is a separate data science problem which Apache Superset doesn't need to solve, currently langchain community (quite popular in datascience) are solving this problem. we just leverage it. </p> <p>this might protect from hallucination <a href="https://python.langchain.com/docs/use_cases/sql/query_checking/">https://python.langchain.com/docs/use_cases/sql/query_checking/</a> Prompting / RAG strategies while working at scale - <a href="https://python.langchain.com/docs/use_cases/sql/large_db/">https://python.langchain.com/docs/use_cases/sql/large_db/</a></p> <p>As both evolve (by time), Quality of Queries will become better & better.</p> <hr /> <blockquote> <p>a query that is 99% correct is functionally worthless if this is intended to be utilized by folks who don't have the skills necessary to parse through AI-generated SQL.</p> </blockquote> <p>I agree with you about this, this doesn't solve fully for those who doesn't necessary knowledge to understand AI generated SQL. It's a copilot instead of an auto pilot.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/surapuramakhil"><img src="https://avatars.githubusercontent.com/u/9161543?v=4" />surapuramakhil</a> commented <strong> 5 months ago</strong> </div> <div class="markdown-body"> <p>Ah, I have found this. </p> <h1>This is a premium feature of Preset</h1> <p><a href="https://preset.io/blog/preset-ai-assist/">https://preset.io/blog/preset-ai-assist/</a></p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/rusackas"><img src="https://avatars.githubusercontent.com/u/812905?v=4" />rusackas</a> commented <strong> 3 months ago</strong> </div> <div class="markdown-body"> <p>@surapuramakhil do you still intend to move forward with this?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/ved-kashyap-samsung"><img src="https://avatars.githubusercontent.com/u/34643160?v=4" />ved-kashyap-samsung</a> commented <strong> 2 months ago</strong> </div> <div class="markdown-body"> <p>Hi @rusackas We have implemented LLM based query generation for our use case which is using self hosted model . We have also developed an adapter which can support popular LLM as service platforms like chat GPT etc. using API key configurations. What is the process ahead to move forward on this sip now. Should we create a PR.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/geido"><img src="https://avatars.githubusercontent.com/u/60598000?v=4" />geido</a> commented <strong> 2 months ago</strong> </div> <div class="markdown-body"> <p>@ved-kashyap-samsung are you on <a href="https://join.slack.com/t/apache-superset/shared_invite/zt-2be0drwz8-bxPfkdz28ozzk1Iox29ufg">Apache Superset Slack</a>? You can find me there as "Diego Pucci". I was the lead engineer for AI Assist for Preset, I should be able to help with getting the SIP right. Please, get in touch.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/rusackas"><img src="https://avatars.githubusercontent.com/u/812905?v=4" />rusackas</a> commented <strong> 2 months ago</strong> </div> <div class="markdown-body"> <p>@ved-kashyap-samsung I think adding AI to Superset requires a proposal for consensus on the approach. If you want to open a PR with what you have, you're more than welcome to, but it's unlikely it'll get merged without going through a SIP proess. You can add your details/approach here if you want to use this SIP, or you can open your own SIP. Please reach out on <a href="http://bit.ly/join-superset-slack">slack</a> if you'd like assistance.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/rusackas"><img src="https://avatars.githubusercontent.com/u/812905?v=4" />rusackas</a> commented <strong> 3 weeks ago</strong> </div> <div class="markdown-body"> <p>Last call for interested parties to sign up and dial in this proposal. In a couple more weeks, this will have gone 6 months without being brought up for discussion on the ASF mailing list, and will be closed as inactive. Thanks to everyone, however it plays out :)</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/HESyang"><img src="https://avatars.githubusercontent.com/u/70452575?v=4" />HESyang</a> commented <strong> 3 weeks ago</strong> </div> <div class="markdown-body"> <p>Maybe a less exciting idea for it: could we just use a chrome extension for the purpose? It will be much easier to use it along with superset, without worrying about the main update on Superset.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/rusackas"><img src="https://avatars.githubusercontent.com/u/812905?v=4" />rusackas</a> commented <strong> 1 week ago</strong> </div> <div class="markdown-body"> <p>You can <em>absolutely</em> use/create a chrome plugin. I think it could be a Superset plugin if you want to author such a thing, or something more generalized. Either way, you wouldn't need a SIP, but we'd be happy to help evangelize the effort if it comes to fruition.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>