airbytehq / PyAirbyte-Hackathon

Tasks for PyAirbyte Hackathon June 2024
0 stars 2 forks source link

New Source Connector: Jina "Reader API" for web scraping and feeding LLM models #32

Closed aaronsteers closed 4 months ago

aaronsteers commented 5 months ago

https://jina.ai/reader/

Overview

The most popular web scraping tool source connector right now is Apify. However, this new API from Jina is focused specifically on LLM use cases and it helpfully outputs markdown which is easy for humans and LLMs to work with. It also doesn't (yet) require a paid account.

The goal is to create a connector which could be used by Airbyte users to leverage this API.

Technical spec

You would write a new source connector which can connect to API and get the scraped content, allowing Airbyte users to send this data downstream to any Airbyte destination.

Notes:

Definition of Done

btkcodedev commented 5 months ago

I would like to take this if possible, this is a rest API type, right? is the response dynamic?

btkcodedev commented 5 months ago

Yes it has a REST API interface and JSON output, Ref: https://jina.ai/reader/#apiform I would like to take this issue CC: @marcosmarxm, there is no previous assignee, I won't miss this time :man_dancing:

aaronsteers commented 5 months ago

@btkcodedev - It's yours! Thanks for jumping in. I'm excited about this one for sure! Let @marcosmarxm or I know if you have any questions along the way!

btkcodedev commented 5 months ago

Linking PR: https://github.com/airbytehq/airbyte/pull/39515

btkcodedev commented 5 months ago

CC: @marcosmarxm @bindipankhudi @aaronsteers :bow: Thanks!!

bindipankhudi commented 5 months ago

Thank you @btkcodedev! Assigning to @aaronsteers for review.

aaronsteers commented 5 months ago

@btkcodedev - This is looking awesome! I added a few comments + suggestions to the PR. 🚀