Blog post: Scraping web data from Apify source into Airbyte for Langchain

aaronsteers commented 6 months ago

Summary

Many Airbyte users want to scrape data from websites into their LLM models. The APIfy source can assist with this but not enough user guides are available as of now. The goal of this tutorial is to show users how to use Apify to scrape data, how to set up the Apify source using PyAirbyte, and then load the data into a vector store using Langchain.

Description

This tasks involves the following steps

Set up an Apify account and create a dataset by scraping a web page (You could scrape Airbyte docs if you like)
Use PyAirbyte to load data from Apify (you can provide the dataset id for this)
Use LangChain to store the loaded data in a database of your choice.

Definition of Done

Blog post / python notebook. When providing a python notebook, please add a "What / Why / How" blurb at the top to explain what the code is doing.

Resources to Assist

PyAirbyte notebook - https://github.com/airbytehq/quickstarts/tree/main/pyairbyte_notebooks (checkout the RAG examples for langchain usage)
Apify source documentation: https://docs.airbyte.com/integrations/sources/apify-dataset
Apify example from previous blog post: https://airbyte.com/tutorials/chat-with-your-data-using-openai-pinecone-airbyte-and-langchain#step-6-additional-data-source-scrape-documentation-website

vspanxcode commented 5 months ago

Hello @aaronsteers, I am interested in this issue, can you assign me this> Should I post blog on medium or hashnode?

marcosmarxm commented 5 months ago

It is yours! @Jeeesrw322 we're going to post in Airbyte blog but you can also post them in other sites too.

vspanxcode commented 5 months ago

@aaronsteers @bindipankhudi Do we need to use Pyairbyte and Langchain for this? I did not get the full picture. Can you explain this further?

bindipankhudi commented 5 months ago

@Jeeesrw322 i have added more details to the ticket. Hope it is clear now. Pls do not hesitate to reach out if you have more questions.

avirajsingh7 commented 5 months ago

@Jeeesrw322 if you have not started working on this issue, Can I take this one?, I have already worked on almost similar issue

vspanxcode commented 5 months ago

@avirajsingh7 Yes, you can take this issue

avirajsingh7 commented 5 months ago

@bindipankhudi please assign me this issue, I will be creating PR soon

bindipankhudi commented 5 months ago

@avirajsingh7 assigned to you.

bindipankhudi commented 5 months ago

Linking PR: https://github.com/airbytehq/quickstarts/pull/116

airbytehq / PyAirbyte-Hackathon