LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.92k stars 3.22k forks source link

Download transcripts of khan academy #184

Closed huu4ontocord closed 1 year ago

huu4ontocord commented 1 year ago

Is it possible someone who has some experience with with youtube or other video scraping to create transcripts of free/opensource lectures from youtube or other video sites preferably where there is a single speaker https://www.youtube.com/@khanacademy/playlists

Khan Academy is a 501(c)(3) nonprofit organization with the mission of providing a free, world-class education for anyone, anywhere. Our interactive practice problems, articles, and videos help students succeed in math, biology, chemistry, physics, history, economics, finance, grammar, and many other topics.

In this kind of format potentially https://karpathy.ai/lexicap/

You could use yt-donwload or https://github.com/m1guelpf/yt-whisper

Please connect with rallio with results

Shtoner commented 1 year ago

Sounds interesting I've never scraped video before, but I'm about to look into it now. Will post if I get a solution. Anyone else with a concrete solution feel free to pick this up, though.

CryptoFewka commented 1 year ago

Would the CreativeCommons NonCommercial ShareAlike license on their videos be a problem? https://support.khanacademy.org/hc/en-us/articles/202262954-Can-I-use-Khan-Academy-s-videos-name-materials-links-in-my-project-

It seems like the ShareAlike license on the Khan Academy videos might entangle the transcripts and things that incorporate them. https://www.theregister.com/2022/10/19/github_copilot_copyright/

shreydan commented 1 year ago

@ontocord there's a pip package youtube-transcript-api which lets us fetch the transcript of a video along with time stamps.

gokaykucuk commented 1 year ago

I think it's also possible to use whisper https://openai.com/blog/whisper/ for getting transcripts. I'll give it a try.

jon-chun commented 1 year ago

Working on this for another project right now: Here is some code for this functionality

References:

# importing modules
from youtube_transcript_api import YouTubeTranscriptApi

# using the srt variable with the list of dictionaries
# obtained by the .get_transcript() function
yt_id = "Sqqt1kU52I8"
srt = YouTubeTranscriptApi.get_transcript(yt_id)

# creating or overwriting a file "subtitles.txt" with
# the info inside the context manager
filename_out = f'yt_subtitles_{yt_id}.txt'
with open(filename_out, "w") as f:
    # iterating through each element of list srt
    for line in srt:
        # writing each element of srt on a new line
        print(line['text'])
        f.write("{}\n".format(line))
jon-chun commented 1 year ago

You'll also need to get all YouTube Video IDs if you want to scrape a particular YT User's Channel

References:

import pandas as pd
import requests
import datetime

api_key = '***'
channel_id = 'UCXDi1F7Q-cJQ4mGGavtfYRQ'
channel_id = 'UCvShfJtvC2owV0AFi_qyykA' # The Helix Center

# build dataframe
df = pd.DataFrame(columns=['channel_id',
                           'video_id',
                           'video_title'
                           'published_date',
                           'type'])

# first request
my_url = 'https://youtube.googleapis.com/youtube/v3/search?part=snippet&channelId=' + channel_id + '&maxResults=50&order=date&type=video&key=' + api_key
response = requests.get(url=my_url).json()
print(my_url)
total_results = response['pageInfo']['totalResults']

# save the channel_id and video_id in a dataframe
for i in response['items']:

    channel_id = i['snippet']['channelId']
    video_id = i['id']['videoId']
    published_date = i['snippet']['publishedAt']
    video_title = i['snippet']['title']
    vid_type = i['id']['kind']

    df = pd.concat([df, pd.DataFrame([{
        'channel_id': channel_id,
        'video_id': video_id,
        'video_title': video_title,
        'published_date': published_date,
        'type': vid_type
    }])], ignore_index=True)

while df['video_id'][len(df)-1] != df['video_id'][len(df)-2]:
    url = 'https://youtube.googleapis.com/youtube/v3/search?part=snippet&channelId=' + channel_id + '&maxResults=50&order=date&type=video&publishedBefore=' + published_date + '&key=' + api_key

    response = requests.get(url=url).json()
    total_results = response['pageInfo']['totalResults']

    for i in response['items']:
        channel_id = i['snippet']['channelId']
        video_id = i['id']['videoId']
        published_date = i['snippet']['publishedAt']
        video_title = i['snippet']['title']
        vid_type = i['id']['kind']

        df = pd.concat([df, pd.DataFrame([{
            'channel_id': channel_id,
            'video_id': video_id,
            'video_title': video_title,
            'published_date': published_date,
            'type': vid_type
        }])], ignore_index=True)

# because the last row is a duplicate we need to delete the last row
df.drop(df.tail(1).index, inplace=True)

# df.to_csv('C:\\Users\\...\\data\\video_ids_' + datetime.datetime.now().strftime('%Y-%m-%d') + '.csv')

df.to_csv('./data/video_ids_' + datetime.datetime.now().strftime('%Y-%m-%d') + '.csv')
jon-chun commented 1 year ago

I am comparing OpenAI Whisper transcription vs YouTube-generated transcription currently.

Research suggests that OpenAI Whisper is better but this is largely heresay. Google is constantly upgrading the transcription models I suspect, so all opinions may be out of date quickly.

futurisold commented 1 year ago

I came here to post an issue about this idea. Seems you guys are already on track. What I had in mind is Whisper and a lot of podcast content as it's one of the low hanging fruits in terms of Q&A. Let's see what @yk has to say about this. I envision us having a working prototype that extracts Q&A from speech.

musabgultekin commented 1 year ago

@leoentersthevoid That is a wonderful idea.

In fact I just looked into the possibility and this is definitely feasible.

I wonder if we can use Youtube based podcasts OR scraping podcast audio data from podcast platforms like Apple/google podcasts for training. In terms of licence perspective.

yk commented 1 year ago

@christophschuhmann do you have ideas about what the license situation of transcribing YT videos and podcasts looks like?

Also, I definitely see potential in collecting diverse QA data. Podcasts, interviews, and the like seem like a good source, except they might be a bit too lengthy, and too informal, but I guess that can be fixed.

We just need to make sure that we also go beyond this QA data. ChatGPT can not only answer questions, but also write email & code, be your friend, etc. and the task diversity of the training data needs to reflect that.

Panda20090 commented 1 year ago

This concept with the Chatbot would be beneficial as part of the whole, if it would take the data from the user and relate the information that the user already has a base understanding of. Then it can find relational concepts within educational videos from the scrapped data and pass new information to the user so that they can quickly grasp the new information that they normally wouldn't look for. I do believe relatability learning is the fastest and most efficient way of learning things. If you already have the information from the user in a secure location by scrapping it from all sources that the user agrees to than it would speed this up dramatically.

Alternatively there could be a platform/webpage that you could link accounts to for individual users to speed up the data harvest and you could store the data there for iterations. This could be equipped with web tools to show a users progress and available paths of information based on the information available.

yk commented 1 year ago

@Panda20090 fully agree. The best place for this might actually be once we add retrieval to the assistant. then the entire licensing problem also vanishes.

huu4ontocord commented 1 year ago

@marianna13 has code for doing YT subtitles too and has scraped. We ned this data to create augmented Q/A training data for immediate experiments. @Shtoner Please discuss with @Rallio67 in the discord if you can get this data to him.

Re long term plans for scraping infra @Panda20090 , please open another issue - or better discuss in the LAION/video2dataset repo and discord channel. Very cool ideas.

Re licensing, it depends on the country. Where LAION is located as a research non-profit, it is using the text data mining exception as I am told. @christophschuhmann

Shtoner commented 1 year ago

259

sedthh commented 1 year ago

We can mass download the youtube subtitles associated with the videos. That way we will also have a the translations for the same english text in multiple languages, so there is no need to automatically transcribe them. We should also get the annotations if possible and overwrite the parts of the speech where Khan makes a slight error (those annotations usually overlap errors made on screen so the video does not have to be rerecorded).

I think it would be worthwile to also scrape the user questions and answers below the videos (after filtering for quality). Those are already in a chat Q A format.

I am currently going through their Terms of Service to ensure it is fine to use their content for training language models.

SnappierSoap318 commented 1 year ago

YT-DLP has the option to save subtitles

sedthh commented 1 year ago

Nice, I was thinking about https://pytube.io/en/latest/user/captions.html

sedthh commented 1 year ago

via https://www.khanacademy.org/about/tos

8. Prohibited Conduct
YOU AGREE NOT TO:
[...]
8.7. develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology) to scrape the Services or otherwise copy lessons and other data from the Services;
8.8. Use bots or other automated methods to access the Services;

So we should contact either privacy@khanacademy.org or notices@khanacademy.org for approval before scraping their data.

marianna13 commented 1 year ago

Oh it means we can't scrape their data?

On Thu, 19 Jan 2023, 21:44 Richard Nagyfi, @.***> wrote:

via https://www.khanacademy.org/about/tos

  1. Prohibited Conduct YOU AGREE NOT TO: [...] 8.7. develop, support or use software, devices, scripts, robots, or any other means or processes (including crawlers, browser plugins and add-ons, or any other technology) to scrape the Services or otherwise copy lessons and other data from the Services; 8.8. Use bots or other automated methods to access the Services;

So we should contact either @. or @. for approval before scraping their data.

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/184#issuecomment-1397444024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKKKRJGDSC7RKMAMRXSDV5LWTGDQTANCNFSM6AAAAAATNGIPLI . You are receiving this because you were mentioned.Message ID: @.***>

marianna13 commented 1 year ago

Yt-dlp is better. It has more options and less limitations.

On Thu, 19 Jan 2023, 17:46 Richard Nagyfi, @.***> wrote:

Nice, I was thinking about https://pytube.io/en/latest/user/captions.html

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/184#issuecomment-1397088964, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKKKRJABWJBQNUVKNJE43IDWTFHSPANCNFSM6AAAAAATNGIPLI . You are receiving this because you were mentioned.Message ID: @.***>

sedthh commented 1 year ago

We should ask for their approval first. And even in the case of an approval we should be extremely careful when scraping:

  1. sponsored videos (they have different licences)
    Unless expressly indicated on the Services that a particular item of Licensed Educational Content is made available to Users under alternate license terms, you may not download, distribute, sell, lease, modify, or otherwise provide access to the Licensed Educational Content to any third party.
  2. user generated data (most of the users are children)
SnappierSoap318 commented 1 year ago

Aren't we technically scraping their content from YouTube and not from their sit, so doesn't it come under YouTube ToS?

sedthh commented 1 year ago

I guess the 3rd party's ToS would have precedence in that case and scraping YouTube only should be ok?

Does anyone else have a take on this?

fcolecumberri commented 1 year ago
#! /bin/bash

CHANNEL='https://www.youtube.com/@khanacademy'

VIDEO_URLS=$(yt-dlp -j --flat-playlist "$CHANNEL" | jq -r '.url')

for VIDEO_URL in $VIDEO_URLS
do
    youtube-dl --write-sub --all-subs --skip-download "$VIDEO_URL"
done

by the way, yt-dlp does not do the trick.

Shtoner commented 1 year ago

@fcolecumberri yt-dlp works from the command line for me. I am currently downloading Khan Academy's audio files. Planning to use Whisper for the text.

fcolecumberri commented 1 year ago

@Shtoner I meant that yt-dlp didn't work well with the --write-sub --all-subs flags.

bitplane commented 1 year ago

I think it's also possible to use whisper https://openai.com/blog/whisper/ for getting transcripts. I'll give it a try.

I've tried this with a few videos for translation and it seems to work for a few minutes then get stuck repeating the same thing over and over. Dunno if anyone has had better experience with it, but I couldn't get it to work without breaking the files into segments first, and messing up breaking them in the wrong places.

escottgoodwin commented 1 year ago

I think it would be legal to scrape the Khan academy site of anything that doesn't require a login to access. This 9th circuit decision against Linkedin is pretty clear. A company was legally allowed to scrape all all no login required data from linkedin even when this was against the TOS. There seems to be a good amount of videos and content on Khan that doesn't require a login. https://www.shrm.org/resourcesandtools/hr-topics/technology/pages/scraping-public-data-from-linkedin-is-legal.aspx

"The 9th Circuit's latest decision relied on the Supreme Court's determination in Van Buren that when information is publicly accessible, no authorization to use that data is required.

The appellate court distinguished between access to publicly available profile information on LinkedIn, which cannot be "unauthorized," and access to information on sites which are restricted to users who sign in to the site with a username and password.

Tse said that what it boils down to is that companies that maintain publicly available information on their websites cannot rely on the CFAA to prohibit others from scraping that data, even if the companies subsequently revoke access to the information, or if data scraping is a violation of the websites' terms of use."

andreaskoepf commented 1 year ago

@Shtoner This issue is currently assigned to you. Are you still working on it?

escottgoodwin commented 1 year ago

No, I just left a comment about legality of downloading transcripts.

andreaskoepf commented 1 year ago

Closing old data issues that have not been completed by now.