Github connector to load code files

danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.

https://docs.danswer.dev/

Other

10.27k stars 1.23k forks source link

Github connector to load code files #1318

Open kaislar opened 5 months ago

kaislar commented 5 months ago

Hello,

Is there any reason why the feature to load codebase was not implemented for the github connector, only issues and prs are loaded.

I worked on a extention of the connector to fetch code files on a given branch using github api by looping over all the commits. Is this something that could interest you ?

If yes i can submit a PR and get your feedbacks.

meyudi commented 5 months ago

Hello @kaislar Hi, I want to understand your use case for indexing the codebase. 🤔

Thanks

kaislar commented 5 months ago

Hello, . You can ask questions about your code base . Explain some parts of the code... It's like microsoft github copilot

dmikulin-dwave commented 4 months ago

I second the request for a GitHub connector that indexes the source code. This is incredibly useful, especially if the code has good doc strings. It is something that the GitHub document loaders for langchain and llama-index already do and is the connector that I value the most in those frameworks. We have a number of repositories consisting of code examples that produce fantastic responses when queried with a RAG populated by these document loaders.

I haven't played enough with danswer yet, but if it exports an API, it could also be connected to aVS Code plugin (continue) to enable completely local (airgapped) co-pilot like capabilities in the IDE.

mattifrind commented 3 months ago

I also like this idea! @dmikulin-dwave does simple RAG works well for you or do you do something a little bit fancier? We also use it for code repositories but it seems like our embedding model sometimes struggles to find the relevant parts for a given question. I thought that standard syntax parsing and retrieving context using the identifiers (class, function names, etc) might work better. Has anyone tried something similar?

dmikulin-dwave commented 3 months ago

@mattifrind: Our open source code and documentation are already easily accessible and easily indexed by one's IDE of choice. However, quantum computing is not an easy new area to pick up; it's a very different paradigm for programming and just knowing the classes and functions, even if you understand what they do well, isn't going to get you very far.

I've created a POC using langchain, document loaders for source code and source code documentation, ChromaDB, and Ollama directly and the preliminary results have been quite good. I'm optimistic that there are a number of additional techniques that can be layered on top of simple vector retrieval, or used in addition to vector retrieval, to provide better context to the LLM to get even better responses.