arc53 / DocsGPT

Chatbot for documentation, that allows you to chat with your data. Privately deployable, provides AI knowledge sharing and integrates knowledge into your AI workflow
https://app.docsgpt.cloud/
MIT License
15.05k stars 1.61k forks source link

Twitter ingestion #1225

Open shatanikmahanty opened 1 month ago

shatanikmahanty commented 1 month ago

🔖 Feature description

Add new remote ingestion method from Twitter

🎤 Why is this feature needed ?

It will allow users to ingest data from Twitter

✌️ How do you aim to achieve this?

I plan to use

https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.twitter.TwitterTweetLoader.html

🔄️ Additional Information

No response

👀 Have you spent some time to check if this feature request has been raised before?

Are you willing to submit PR?

Yes I am willing to submit a PR!

shatanikmahanty commented 1 month ago

@dartpain may I work on this. I will try to achieve this by following examples and testing it out over the coming week!

dartpain commented 1 month ago

Seems like a cool idea. I do have one suggestion to make this a killer feature.

Most people might scrape X/twitter once in a while. But what if we do it similarly to https://github.com/arc53/DocsGPT/blob/main/application/retriever/brave_search.py

Such that instead of ingesting data into similarity search vectordb we can create a search query to X/twitter and analyze current data.

shatanikmahanty commented 1 month ago

@dartpain seems like a really cool addition. Will investigate the suggested integration and start on this by mid next week. Will keep updating the status here!

shatanikmahanty commented 1 month ago

@dartpain after careful review of the requirements I found out that langchain doesn't have twitter search. They had an open issue in which they mentioned it won't be implemented because of pricing related concerns. Attaching the link to the same: https://github.com/langchain-ai/langchain/issues/11538

Although search can be integrated through using the Twitter search API, I have one concern is how will we process the question put forward by the user as a prompt. In case of LangChain we use the run method on the search result. If we go with the twitter API, is there anything similar we can do?

dartpain commented 1 month ago

I suggest you even use llm to genrate a search query and then use it in the search api

shatanikmahanty commented 1 month ago

I suggest you even use llm to generate a search query and then use it in the search api

I see, thanks for the suggestion. I will use it accordingly and generate search queries. Once we are done with generating search results, I plan to pass that to the LLM again and summarise search results to give a readable answer

shatanikmahanty commented 1 month ago

@dartpain I was trying to use the classic rag to generate a twitter query in my local, but it kept on generating the same output of project contribution guide and some other stuff that pointed to github of DocsGPT. By using LLM did you mean something else?

dartpain commented 1 month ago
  1. Check out this, https://github.com/arc53/DocsGPT/tree/main/application/retriever you will need to create a separate file here. while testing / experimenting I suggest you change classic rag.
  2. You will see that it uses LLM abstract class there, thats what I meant.

thank you!

shatanikmahanty commented 1 month ago

@dartpain thanks for the additional context on LLMs. I was able to generate a search term for Twitter using the LLM, but on trying to access the twitter api I found out that it can only be used by paid plan subscribers. If anyone is willing to provide me an api key to test with I can create a PR. Meanwhile I will draft a PR with my current work and highlight the blockers so that in case anyone with access to paid api wants to continue with the rest of the PR they can go ahead. Thanks again for letting me work on this 🚀