LuanRT / YouTube.js

A wrapper around YouTube's internal API — reverse engineering InnerTube
https://www.npmjs.com/package/youtubei.js
MIT License
3.48k stars 219 forks source link

I there a transcription API? #453

Closed bezaleel22 closed 1 year ago

bezaleel22 commented 1 year ago

Question

Is there any support for fetching video transcription from YouTube using this library. If not, it will be nice have this feature. Thank for this great work.

Other details

No response

Checklist

tomByrer commented 1 year ago

Might not be as covenant, but I have my own fork of a YouTube transcription downloader. I forked so I can include basic HTML formatting, & shape the JSON how I wanted it. https://github.com/tomByrer/youtube-captions-scraper

seomikewaltman commented 1 year ago

You can extended the Innertube class to do it.

// InnerYouTube.js
import { Innertube, Utils } from 'youtubei.js';

export default class InnerYouTube extends Innertube {

   async getTranscriptsParameters(id) {

        if (!id) throw new Utils.MissingParamError('Video id is missing');

        const uri = `/watch?v=${id}`;
        const response = await this.session.http.fetch(uri, { method: "GET", baseURL: 'https://www.youtube.com' })
        .then(r => r.text());

        const params = Utils.getStringBetweenStrings(response, 'getTranscriptEndpoint":{"params":"', '"}}}},');
        if (params) return params;

        throw new Utils.ParsingError(`getTranscriptEndpoint not found ${id}`);

    }

async getTranscript(id) {

        if (!id) throw new Utils.MissingParamError('Video id is missing');

        const params = await this. getTranscriptsParameters(id);
        const url = `/get_transcript?key=${this.key}`;
        const context = this.session.context;
        const opts = {
            method: "POST",
            body: JSON.stringify({
                context,
                params,
            }),
            baseURL: 'https://www.youtube.com/youtubei/v1'
        }

        const response = await this.session.http.fetch(url, opts)
            .then(r => r.json())

        return response
    }

}

Implementation is a bit different since they made static method create() for generating the instance.

import InnerYouTube from './InnerYouTube.js';

// Normal way
// const innertube await InnerTube.create(...options_go_here);

// extending youtubei with your own class
const innertube = await new InnerYouTube(await Session.create(...options_go_here));
const transcripts_json = await innertube. getTranscript(videoId);

// parse the json as you see fit

Since this hits the youtube.com/watch URL to get the parameters for the transcript's youtubei/v1 endpoint needed to see the transcripts stream you will get recaptcha'ed if you do a lot of these per minute. You'll want to use a proxy.

import InnerYouTube from './InnerYouTube.js';
import { ProxyAgent, fetch } from 'undici';

const proxyClient = new ProxyAgent('http://yourproxyhere');

const innertube = await new InnerYouTube(await Session.create({
              fetch: async (input, init) => {
                  return fetch(input, { ...init, dispatcher: proxyClient })
              },
          }));
const transcripts_json = await innertube. getTranscript(videoId);