airbytehq / PyAirbyte-Hackathon

Tasks for PyAirbyte Hackathon June 2024
0 stars 2 forks source link

New Source Connector: 🤗 "Hugging Face Datasets" (optionally via DuckDB 🦆 ) #30

Open aaronsteers opened 5 months ago

aaronsteers commented 5 months ago

Overview

This blog post came out 2 weeks ago, announcing a new feature where DuckDB can now extract from hugging face datasets using the hf:// URI prefix.

We think this would make an awesome connector for users in our community.

https://duckdb.org/2024/05/29/access-150k-plus-datasets-from-hugging-face-with-duckdb.html

Technical spec

You would write a new source connector which can connect to Hugging Face source datasets and emit records from them, allowing Airbyte users to send these to any Airbyte destination.

Notes:

Definition of Done

ombhardwajj commented 5 months ago

@aaronsteers I am interested in working on this and also willing to work on #31 which is closely related to this! Please assign it to me!

aaronsteers commented 5 months ago

Awesome! You are the first to chime in so I think this one is yours! Can you also drop a comment in the other issue. (GitHub won't let me assign otherwise.)

ombhardwajj commented 5 months ago

@aaronsteers I've started working on this issue and started buiilding a connector for hugging face datasets in python cdk. But I just wanted to make sure if this issue and #31 are part of feature contributions because recently I was not assigned #20 in quickstarts (probably due to confusion as these issues #30 , #31 are in No Hackathon category currently).I had been waiting to get it assigned since past 5 days! Even before I had got this assigned!

aaronsteers commented 4 months ago

Hi, @ombhardwajj . I apologize for any confusion. I've put this and #31 into the Feature Contributions categories.

Do you need any assist on this item or on #31?

ombhardwajj commented 4 months ago

@aaronsteers Thanks for the concern. Regarding #31, I am first going to solve for this issue then I'll start solving #31. Currently I am facing some dependency "conflicts", so I was thinking of shifting to lowcode instead of Python cdk does that work with you? Otherwise I'll give it another try...

ombhardwajj commented 4 months ago

Over the past week, I tried to build this but, unfortunately, I have been facing some errors. Despite my efforts to resolve them, I have not been successful. Therefore, I am un-assigning myself from this issue.

bala-ceg commented 4 months ago

Hi @aaronsteers, can i work on this issue?

aaronsteers commented 4 months ago

@ombhardwajj - I understand. Thanks for looping back.

@bala-ceg - If you still are wanting to pick this up, it is yours. 👍

bala-ceg commented 4 months ago

@marcosmarxm @aaronsteers can you please let me know which connector development method i should follow - python cdk or lowcode cdk

marcosmarxm commented 4 months ago

Low-code if possible but if it isn't you need to you Python CDK