arc53 / DocsGPT

Chatbot for documentation, that allows you to chat with your data. Privately deployable, provides AI knowledge sharing and integrates knowledge into your AI workflow
https://app.docsgpt.cloud/
MIT License
14.95k stars 1.58k forks source link

Support for remote info stores like, website, Confluence, Sharepoint etc, #27

Open emanueol opened 1 year ago

emanueol commented 1 year ago

Or must all files exist locally ?

In real world of large enterprises, theres a confluence server, a jira server, sharepoint server that typically reside in a data center or as a SaaS cloud, and some on-prem custom html, excel files etc.

Would be great a ChatGPT type of ingest/compute searches across remote systems.. how feasible is this ? Thanks

dartpain commented 1 year ago

Thats a good feature we can start working on, as loang as we can prep it in a neat and readable format we can ingest it all. But I do think we have to vectorise all this data first or summarise at least.

terrafying commented 1 year ago

I was able to pull confluence data by exporting the whole space as HTML (in a zip), extracting it to the docsgpt folder, changing the glob pattern in ingest_rst.py to point to the extracted HTML files, and using BeautifulSoup to pull out the text content inside the main

tags (soup.find_all('div', attrs={"class": "wiki-content group"})). I couldn't get the API to download a whole space at once, but the manual method worked decently, as long as there isn't a huge number of spaces to deal with.

emanueol commented 1 year ago

I suppose theres 2 things:

initial export (Confluence allows exporting a space in couple different formats).

incremental updates (Jira supports webhooks , sending json via http post to some listener)

Or maybe easier to prep a clone so you gpt stuff doesn't interfer with users, but it all boils down to file formats.

im not specialist of under the hood of Confluence, Jira, etc.. but im part of those 0.001% that care centralizing knowledge for both tech guys (majority) and satisfy business/stakeholders. Its a very interesting point of discussion "hiw to decide organization of information, but i suppose some basic high level stuff could be inserted on the source pages (tags, etc) to help with categorization, as least avoid showing code to business manager when he just looking for the 5 business rules agreed with Development in some project. etc.

JohnRSim commented 1 year ago
dartpain commented 1 year ago

I think what needs to be built here is just a module for our parser. basically something that loads data and converts it into (.rst, .md, .pdf, .docx, .csv, .epub, .html) As DocsGPT already loads this files with easy. Its just method of scraping files that we need to implement here

tardigrde commented 1 year ago

This feature would be a killer. TBH I gave it a try and it's a more complex proeblem than I thought. OpenAI example would try to fetch every URL on the website and put every text in embeddings. The problem is that there is a lot of gibberish, non relvant text on these webpages.

We could start by something simple, like Wikipedia. There should be some good projects already doing webscraping on wikipedia in Python. But then again I think it's a big development effort. I think other projects like AutoGPT do something similar already.

tardigrde commented 1 year ago

FYI: This nice repo implemented loading text from YT videos as well as websites https://github.com/embedchain/embedchain

KennyDizi commented 1 year ago

Deeplake could be the good fit? https://github.com/activeloopai/deeplake

dartpain commented 1 year ago

I think this would fit more of a auto fine tune situation. We need a more general solution such that we can ingest data in different vectorstores, if users want to use faiss or elasticsearch or pinecone...

dartpain commented 1 year ago

Also @pabik is already working on it in the feature/remote-loads branch We will also need to do the UI in parallel

thefoodiecoder commented 7 months ago

Llama Index provides support for ingestion from these sources. We can either look into integrating or porting these.

All loaders - https://llamahub.ai/?tab=readers

dartpain commented 7 months ago

Currently we have some remote loaders presend, but not confluence and sharepoint check them out here: https://github.com/arc53/DocsGPT/tree/main/application/parser/remote If you want to contribute, would be very happy!

thefoodiecoder commented 7 months ago

Python isn't my forte but let me see if someone from my Data Science team is willing to contribute.