Samagra-Development / ai-tools

AI Tooling to bootstrap applications fast
44 stars 110 forks source link

Create a vector search from youtube audio transcripts #289

Closed Gautam-Rajeev closed 8 months ago

Gautam-Rajeev commented 9 months ago

Description

Be able to parse all the videos from a Youtube channel or Youtube playlist , extract transcripts from their audios and embed them in a vector DB to enable search/retrieve over it .

Implementation Details

It'll include the following :

Can use https://github.com/ytdl-org/youtube-dl for scraping Can use https://www.youtube.com/@3blue1brown as initial test set for the above Ticket for using ColBERT is covered here, you only need to make it work locally here using the notebook.

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Pytorch/ Python, ML

Category

Feature

Mentor(s)

@GautamR-Samagra

Complexity

Medium

c4gt-community-support[bot] commented 9 months ago

Hi! Important Details - These following details are helpful for contributors to effectively identify and contribute to tickets.

Please update the ticket

Neelesh2512 commented 9 months ago

Guys, Anyone of you can contribute. Let's not wait for the approval. We can start working and raise a PR whenever we want 🙌🏻

Gautam-Rajeev commented 9 months ago

Hi all. Glad to see the enthusiasm here :) You don't have to ask permission to begin working on tickets. Please raise PRs and comment links to PRs here. I'll not be assigning anyone the ticket as such now

ChakshuGautam commented 9 months ago

Hey team. Please raise a draft PR that we can review to see if everyone is going in the right direction. Thanks.

kartikf4 commented 9 months ago

@ChakshuGautam I'm facing this issue while working in colab Environment DownloadError: ERROR: Unable to extract uploader id; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. I have updated multiple times and tried with other version but it's still not working for me. while using yt-dlp for same ,it does perform well upto certain extent. should I continue with yt-dlp.

ChakshuGautam commented 9 months ago

@kartikf4 Is this happening on non colab env as well? Any alternatives to this package that you tried out?

kartikf4 commented 9 months ago

@kartikf4 Is this happening on non colab env as well? Any alternatives to this package that you tried out?

@ChakshuGautam well i didnt tried in local env but i did tried alternative yt-dlpcheck here

ChakshuGautam commented 9 months ago

Probably has something to do with colab. Let's do locally.

rachitavya commented 9 months ago

@kartikf4 Is this happening on non colab env as well? Any alternatives to this package that you tried out?

@ChakshuGautam well i didnt tried in local env but i did tried alternative yt-dlpcheck here

Hey @kartikf4, This one is doing fine here.

anshuvermaa commented 9 months ago

Hi I want to contribute to this can you assign me

Gautam-Rajeev commented 9 months ago

@ChakshuGautam https://pypi.org/project/youtube-transcript-api/ gives the transcripts for all videos in English/Hindi (from the auto generated cc). Can we clarify on the merits of extracting audio and transcribing separately apart from what is given using the above? Do we want to do that for Indian language videos ?

xorsuyash commented 9 months ago

@ChakshuGautam ,@GautamR-Samagra on the further improvement on the issue

ChakshuGautam commented 9 months ago

@xorsuyash can you share a draft PR anyway so that we can review in chunks?

xorsuyash commented 9 months ago

@ChakshuGautam raised draft-pr

rachitavya commented 9 months ago

Hey @xorsuyash,

Let's drop vector and colbart part until the issue is resolved. Abhi ke liye we'll keep it simple

Single API: param - yt video link response - transcript.json

Also I have some questions:

xorsuyash commented 9 months ago

@rachitavya

Gautam-Rajeev commented 8 months ago

@xorsuyash Thanks for completing this.

cc: @Shruti3004 , @ChakshuGautam