algolia / youtube-captions-scraper

Fetch youtube user submitted or fallback to auto-generated captions
249 stars 67 forks source link

It is stopped working #30

Open bholagourav opened 1 month ago

bholagourav commented 1 month ago

For a video which has captions it is throwing error message could not find captions for the video.On local env. it is working fine.

dfdeagle47 commented 3 weeks ago

I'm also encountering the issue (probably since August 7th, 2024). From what I could understand so far, it's because the response from YouTube when fetching the page is missing the captionTracks property, hence the error in the code triggered here.

I see that the YouTube response seems to differ based on the environment. If I run this command:

wget -qO- 'https://www.youtube.com/watch?v=qhszd_wqAgQ' | grep 'captionTracks'

locally, I will get a match. However, when running it on the server, there are no matches.

Maybe YouTube A/B testing something, and serving different HTML content based on the location. Or maybe they decided to remove this property when the request comes from a server in a data center.

Technically, this is an unofficial way to retrieve the captions. AFAIK, the official way only allows retrieving the captions for your own videos, so they might not like other platforms scrapping the captions...

NikeshCohen commented 2 weeks ago

Mmmm, that is quite unfortunate. I've been trying to retrieve captions from videos aswell. An alternative I've found is this link that is in each YouTube video.

https://www.youtube.com/api/timedtext?v=dq3j-NTqJX4&ei=3gjLZrnMCsGdp-oPnOLjuQI&caps=asr&opi=112496729&exp=xbt&xoaf=4&hl=en-GB&ip=0.0.0.0&ipbits=0&expire=1724607310&sparams=ip,ipbits,expire,v,ei,caps,opi,exp,xoaf&signature=66D15C767604C769FCB036A11473B586E39C505B.5B635F4CF960F33FA808ABA17577D5E941468A75&key=yt8&kind=asr&lang=en

But I see its also left out when pinging the page from a server, there are free sites like https://downsub.com/ that extract caption from videos, its a free tool so I doubt they are using an extreme way to fetch the captions as that would cost money. Any ideas for ways around this?

dfdeagle47 commented 2 weeks ago

I was looking at other libs like @os-team/youtube-captions to extract the caption to see their approach.

That lib uses yt-dlp to extract the caption, so I tried running this tool on the server to see what would happen (it works locally):

$ yt-dlp --list-subs 'https://www.youtube.com/watch?v=qhszd_wqAgQ'

[youtube] Extracting URL: https://www.youtube.com/watch?v=qhszd_wqAgQ
[youtube] qhszd_wqAgQ: Downloading webpage
[youtube] qhszd_wqAgQ: Downloading ios player API JSON
[youtube] qhszd_wqAgQ: Downloading web creator player API JSON
ERROR: [youtube] qhszd_wqAgQ: Sign in to confirm you’re not a bot. This helps protect our community. Learn more

So it seems that YouTube is detecting that the request might be coming from a bot (based on the IP I suppose).

Maybe it's possible to make it work by signing in and passing the cookie to yt-dlp. However, I expect that this will be rate-limited if you're making too many requests.

Also links that might be relevant:

NikeshCohen commented 2 weeks ago

Thanks for that info! Honestly quite crappy that YouTube is blocking requests from servers (although I understand why they would want to).

Another alternative I stumbled upon is directly accessing captions via the YouTube API, however it costs 220 tokens per request so its not a very scalable option

dfdeagle47 commented 2 weeks ago

I stumbled upon this code: https://stackoverflow.com/a/70013529

It uses a different API (YouTube's internal API) to retrieve the captions.

I tested the Python code on my server and it seemed to work on principle. I didn't check if the response contains everything, but I'll investigate some more. I want to translate the code in nodejs and see how it behaves.

Although, I don't know how scalable it is. Your server IP might be banned if you make too many calls perhaps.

dfdeagle47 commented 2 weeks ago

Thanks for that info! Honestly quite crappy that YouTube is blocking requests from servers (although I understand why they would want to).

Another alternative I stumbled upon is directly accessing captions via the YouTube API, however it costs 220 tokens per request so its not a very scalable option

When you say the YouTube API, you're talking about the official YouTube API Data v3? The limitation I had is that it only allows you retrieve the captions for videos you own.

NikeshCohen commented 2 weeks ago

When you say the YouTube API, you're talking about the official YouTube API Data v3? The limitation I had is that it only allows you retrieve the captions for videos you own.

Correct yes. Ah i see, i see. Thats not very helpful then, I haven't implemented that as yet, was just info I found. So that option is out the box.

NikeshCohen commented 2 weeks ago

I stumbled upon this code: https://stackoverflow.com/a/70013529

It uses a different API (YouTube's internal API) to retrieve the captions.

I tested the Python code on my server and it seemed to work on principle. I didn't check if the response contains everything, but I'll investigate some more. I want to translate the code in nodejs and see how it behaves.

Although, I don't know how scalable it is. Your server IP might be banned if you make too many calls perhaps.

Ah interesting, similar method to what I pivoted to (the stackoverflow link). Its most likely that you aren't getting any any data, although I stand to be corrected, YouTube sends back an OK res but it contains nothing useful at all.

For reference this is the current logic I'm using, not utilizing any lib just raw data from the response when sending a GET to the video:


export const fetchVideoData = async (videoId: string) => {
  try {
    let transcription = "";
    let videoTitle;

    const res = await fetch(`https://www.youtube.com/watch?v=${videoId}`);
    const html = await res.text();

    const titleMatch = html.match(/<title>(.*?)<\/title>/i);
    if (titleMatch) {
      videoTitle = titleMatch[1];
    } else {
      videoTitle = "No title found, ignore this text";
    }

    const captionUrlMatch = html.match(/"captionTracks":.*?"baseUrl":"(.*?)"/);
    if (!captionUrlMatch) {
      throw new Error("Unable to fetch transcription from YouTube");
    }

    const captionUrl = captionUrlMatch[1].replace(/\\u0026/g, "&");
    const captionRes = await fetch(captionUrl);
    const captionXML = await captionRes.text();

    const parsedResult = await parseStringPromise(captionXML);

    if (
      parsedResult &&
      parsedResult.transcript &&
      parsedResult.transcript.text
    ) {
      parsedResult.transcript.text.forEach((textElement: any) => {
        if (textElement._) {
          transcription += textElement._ + " ";
        }
      });
    }

    return {
      transcription: transcription.trim(),
      videoTitle,
    };
  } catch (error) {
    throw new Error("Unable to fetch information from YouTube");
  }
};
dfdeagle47 commented 2 weeks ago

OK I went down a bit of a rabbit hole to find alternative.

InnerTube

First of all, I learned about InnerTube. The gist of it is that the YouTube website uses a different (private) API to interact with the browser.

There is a well-maintained JS lib YouTube.js. Unfortunately, it suffers from the same limitation. For instance, the code below:

import { Innertube } from 'youtubei.js';

async function main() {
  const youtube = await Innertube.create();

  const videoInfo = await youtube.getBasicInfo('pyX8kQ-JzHI');

  console.log(videoInfo);
}
main();

works fine locally because you can retrieve the captions under videoInfo.captions.caption_tracks (beware: it uses snake_case in the response compared to the lib).

However, it returns the same error as always on the server:

VideoInfo {
  basic_info: {
    embed: null,
    channel: null,
    is_unlisted: undefined,
    is_family_safe: undefined,
    category: null,
    has_ypc_metadata: null,
    start_timestamp: null,
    end_timestamp: null,
    view_count: undefined,
    url_canonical: null,
    tags: null,
    like_count: undefined,
    is_liked: undefined,
    is_disliked: undefined
  },
  annotations: undefined,
  storyboards: undefined,
  endscreen: undefined,
  captions: undefined,
  cards: undefined,
  streaming_data: undefined,
  playability_status: {
    status: 'LOGIN_REQUIRED',
    reason: 'Sign in to confirm you’re not a bot',
    embeddable: false,
    audio_only_playablility: null,
    error_screen: PlayerErrorMessage {
      type: 'PlayerErrorMessage',
      subreason: [Text],
      reason: [Text],
      proceed_button: [Button],
      thumbnails: [Array],
      icon_type: 'ERROR_OUTLINE'
    }
  },
  player_config: undefined
}

get_transcript PoC

Here's a code I got working on the server:

const axios = require('axios');
const protobuf = require('protobufjs');
const Buffer = require('buffer').Buffer;

const VIDEO_ID = 'pyX8kQ-JzHI';

function getBase64Protobuf(message) {
  const root = protobuf.Root.fromJSON({
    nested: {
      Message: {
        fields: {
          param1: { id: 1, type: 'string' },
          param2: { id: 2, type: 'string' },
        },
      },
    },
  });
  const MessageType = root.lookupType('Message');

  const buffer = MessageType.encode(message).finish();

  return Buffer.from(buffer).toString('base64');
}

async function main() {
  try {
    const message1 = {
      param1: 'asr',
      param2: 'en',
    };

    const protobufMessage1 = getBase64Protobuf(message1);

    const message2 = {
      param1: VIDEO_ID,
      param2: protobufMessage1,
    };

    const params = getBase64Protobuf(message2);

    const url = 'https://www.youtube.com/youtubei/v1/get_transcript';
    const headers = { 'Content-Type': 'application/json' };
    const data = {
      context: {
        client: {
          clientName: 'WEB',
          // clientVersion: '2.20240826',
          clientVersion: '2.20240826.01.00',
        },
      },
      params,
    };

    const response = await axios.post(url, data, { headers });

    let output =
      response.data.actions[0].updateEngagementPanelAction.content.transcriptRenderer.content.transcriptSearchPanelRenderer.body.transcriptSegmentListRenderer.initialSegments.map(
        (segment) => {
          const { endMs, startMs, snippet } = segment.transcriptSegmentRenderer;

          const text = snippet.runs.map((run) => run.text).join('');

          return {
            start: parseInt(startMs) / 1000,
            dur: (parseInt(endMs) - parseInt(startMs)) / 1000,
            text,
          };
        },
      );

    console.log(output);
  } catch (err) {
    console.error('Error:', err);
  }
}

main();

The idea was to have output use the same interface more or less (although I noticed start and dur are never cast to Number here so they're actually String and this is wrong).

It calls the youtubei/v1/get_transcript' endpoint with the proper protobuf message. It assumes you know the language you want to retrieve though.

Regarding message1

This is the version you want to use for automatically-generated captions:

const message1 = {
  param1: 'asr',
  param2: 'en',
};

And this is the version you want to use if you want captions that were uploaded by the creator:

const message1 = {
  param2: 'en',
};

(obviously, the code currently crashes if you don't use the right message and there are no captions for given message1 params)

Invidious approach

I found some interesting info on the invidious repo:

I'm not familiar with the Crystal programming language, so it's a bit harder to navigate, but I wanted to check how they retrieve the video info to get the list of possible captions to get the "default" caption, but I might try another approach.

dfdeagle47 commented 2 weeks ago

Here's my last attempt of the day.

I've implemented a getDefaultSubtitleLanguage function which attempts to retrieve the default language that should be used for the subtitles. It relies on:

This is optional of course, and you could just attempt the different variations if you know the languages in advance.

Although it uses your API quota, those two endpoints work with the API key, unlike https://developers.google.com/youtube/v3/docs/captions/download which only works with Google OAuth 2.0 and for your videos.

I used Invidious' code for inspiration about the parsing of the YouTube response when fetching the transcript.

Ignoring the quota issue, I don't know what would be the rate-limit for the /youtubei/v1/get_transcript endpoint though...

Code

import { youtube_v3 } from '@googleapis/youtube';
import axios from 'axios';
import { Buffer } from 'buffer';
import protobuf from 'protobufjs';

const youtubeClient = new youtube_v3.Youtube({
  auth: '<YOUR-YOUTUBE-API-KEY>',
});

/**
 * Helper function to encode a message into a base64-encoded protobuf
 * to be used with the YouTube InnerTube API.
 * @param {Object} message - The message to encode
 * @returns {String} - The base64-encoded protobuf message
 */
function getBase64Protobuf(message) {
  const root = protobuf.Root.fromJSON({
    nested: {
      Message: {
        fields: {
          param1: { id: 1, type: 'string' },
          param2: { id: 2, type: 'string' },
        },
      },
    },
  });
  const MessageType = root.lookupType('Message');

  const buffer = MessageType.encode(message).finish();

  return Buffer.from(buffer).toString('base64');
}

/**
 * Returns the default subtitle language of a video on YouTube.
 * @param {String} videoId
 * @returns {Promise<{ trackKind: String, language: String }>} - The default subtitle language and the track kind (e.g., 'asr' or 'standard').
 */
async function getDefaultSubtitleLanguage(videoId) {
  // Get video default language
  const videos = await youtubeClient.videos.list({
    part: ['snippet'],
    id: [videoId],
  });

  if (videos.data.items.length !== 1) {
    throw new Error(`Multiple videos found for video: ${videoId}`);
  }

  const preferredLanguage =
    videos.data.items[0].snippet.defaultLanguage ||
    videos.data.items[0].snippet.defaultAudioLanguage;

  // Get available subtitles
  const subtitles = await youtubeClient.captions.list({
    part: ['snippet'],
    videoId: videoId,
  });

  if (subtitles.data.items.length < 1) {
    throw new Error(`No subtitles found for video: ${videoId}`);
  }

  const { trackKind, language } = (
    subtitles.data.items.find(
      (sub) => sub.snippet.language === preferredLanguage,
    ) || subtitles.data.items[0]
  ).snippet;

  return { trackKind, language };
}

/**
 * Helper function to extract text from certain elements.
 * Inspired by Invidious' extractors_utils.cr
 * https://github.com/iv-org/invidious/blob/384a8e200c953ed5be3ba6a01762e933fd566e45/src/invidious/yt_backend/extractors_utils.cr#L1-L30
 * @param {Object} item - The item to extract text from.
 * @returns {string} The extracted text.
 */
function extractText(item) {
  return item.simpleText || item.runs?.map((run) => run.text).join('');
}

/**
 * Function to retrieve subtitles for a given YouTube video.
 * @param {Object} options - The options for retrieving subtitles
 * @param {String} options.videoId - The ID of the video
 * @param {String} options.trackKind - The track kind of the subtitles (e.g., 'asr' or 'standard')
 * @param {String} options.language - The language of the subtitles
 * @returns {Promise<Array<{ start: Number, dur: Number, text: String }>>} - The subtitles of the video
 */
async function getSubtitles({ videoId, trackKind, language }) {
  const message = {
    param1: videoId,
    param2: getBase64Protobuf({
      // Only include `trackKind` for automatically-generated subtitles
      param1: trackKind === 'asr' ? trackKind : null,
      param2: language,
    }),
  };

  const params = getBase64Protobuf(message);

  const url = 'https://www.youtube.com/youtubei/v1/get_transcript';
  const headers = { 'Content-Type': 'application/json' };
  const data = {
    context: {
      client: {
        clientName: 'WEB',
        clientVersion: '2.20240826.01.00',
      },
    },
    params,
  };

  const response = await axios.post(url, data, { headers });

  // Mapping inspired by Invidious' transcript.cr
  // https://github.com/iv-org/invidious/blob/432c25ad8626fee401b1f349b463515d21718ac8/src/invidious/videos/transcript.cr#L51-L101
  const initialSegments =
    response.data.actions[0].updateEngagementPanelAction.content
      .transcriptRenderer.content.transcriptSearchPanelRenderer.body
      .transcriptSegmentListRenderer.initialSegments;

  if (!initialSegments) {
    throw new Error(
      `Requested transcript does not exist for video: ${videoId}`,
    );
  }

  const output = initialSegments.map((segment) => {
    const line =
      segment.transcriptSectionHeaderRenderer ||
      segment.transcriptSegmentRenderer;

    const { endMs, startMs, snippet } = line;

    const text = extractText(snippet);

    return {
      start: parseInt(startMs) / 1000,
      dur: (parseInt(endMs) - parseInt(startMs)) / 1000,
      text,
    };
  });

  return output;
}

//////////////
//////////////

async function main({ videoId }) {
  try {
    const { language, trackKind } = await getDefaultSubtitleLanguage({
      videoId,
    });

    const subtitles = await getSubtitles({
      language,
      trackKind,
      videoId,
    });

    console.log(subtitles);
  } catch (err) {
    console.error('Error:', err);
  }
}

// Video with ASR captions
main({ videoId: 'pyX8kQ-JzHI' });
// Video with uploaded captions
main({ videoId: '-16RFXr44fY' });
// Video with multiple caption tracks (`defaultAudioLanguage: 'ru'`)
main({ videoId: 'qwQwSTWHTAY' });
NikeshCohen commented 2 weeks ago

Bro is absolutely cooking 👨‍🍳 Will have a crack at these ideas and see if it works, the main issue is that YouTube seems to be cracking down on scrappers like crazy, so even if we end up finding a solid solution its only a matter of time before they block that as well.

The reason for my thought process behind my statement: https://youtube.com/shorts/xiJMjTnlxg4?si=TXnwg3NnbBK2UPG1

NikeshCohen commented 2 weeks ago

OK I went down a bit of a rabbit hole to find alternative.

BRO YOU ARE A LEGEND, got it working💪. Would be sick to connect with you and pick your brain a bit in regards to your thought process, my social links are in my bio, if not that's totally cool. Thank you!🐐