Kakulukian / youtube-transcript

Fetch transcript from a youtube video
308 stars 61 forks source link

[YoutubeTranscript] 🚨 TypeError: Cannot read properties of undefined (reading 'transcriptBodyRenderer') #19

Closed Ozanaydinn closed 7 months ago

Ozanaydinn commented 7 months ago

Hello, I'm not sure if this project is still getting maintenance but still wanted to create an issue for this!

We are using v1.1.0 in our project and this morning suddenly we started getting this error :

Error message: [YoutubeTranscript] 🚨 TypeError: Cannot read properties of undefined (reading 'transcriptBodyRenderer')

I also tried to run the example code in a brand new project but still the same error, so I guess that eliminates any errors that might have happened on our side. Any help regarding the issue would be greatly appreciated!

Klajver07 commented 7 months ago

Hello im having same problem @Ozanaydinn

PaulBratslavsky commented 7 months ago

I haven't checked in too much detail what is happening but I think it has something to do with YouTube itself.

https://github.com/Kakulukian/youtube-transcript/blob/0c6c6e7ed226ab2be2e0ebc94d8f6480b10aa3c0/src/index.ts#L52

Maybe this line of code is no longer getting the correct API.

canadaduane commented 7 months ago

It looks like the API must have changed. There is no longer a .body property in actions[0].updateEngagementPanelAction.content.transcriptRenderer:

image

In other words this line is failing:

const transcripts =
  body.actions[0].updateEngagementPanelAction.content
    .transcriptRenderer.body.transcriptBodyRenderer.cueGroups
canadaduane commented 7 months ago

Does anyone know more about the /timedtext internal API? It seems to provide the transcript data, but is behind a signature field.

image
timothycarambat commented 7 months ago

Yeah, looks like the endpoint was killed - RIP. This script seems to emulate a few steps in the script but instead of getting INNERTUBE it gets the signed URL for that session to get /timedtext

Probably very flaky and requires an HTML parser

import { parse } from "node-html-parser";

const PAGE = await fetch("https://www.youtube.com/watch?v=bZQun8Y4L2A")
  .then((res) => res.text())
  .then((html) => parse(html));

const scripts = PAGE.getElementsByTagName("script");
const playerScript = scripts.find((script) =>
  script.textContent.includes("var ytInitialPlayerResponse = {"),
);

const dataString = playerScript.textContent
  ?.split("var ytInitialPlayerResponse = ")?.[1]
  ?.slice(0, -1);
const data = JSON.parse(dataString.trim());
const captionsUrl =
  data.captions.playerCaptionsTracklistRenderer.captionTracks[0].baseUrl;

const resXML = await fetch(captionsUrl)
  .then((res) => res.text())
  .then((xml) => parse(xml));

let transcript;
const chunks = resXML.getElementsByTagName("text");
for (const chunk of chunks) {
  transcript += chunk.textContent;
}
console.log(transcript); // :)
SchmitzAndrew commented 7 months ago

Appreciate the quick fix!

timothycarambat commented 7 months ago

If anybody is looking at making a class for this. Here is my currently working example for a quick patch. No promises it will keep working :)

const { parse } = require("node-html-parser");
const RE_YOUTUBE =
  /(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=)|youtu\.be\/)([^"&?\/\s]{11})/i;
const USER_AGENT =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)";

class YoutubeTranscriptError extends Error {
  constructor(message) {
    super(`[YoutubeTranscript] ${message}`);
  }
}

/**
 * Class to retrieve transcript if exist
 */
class YoutubeTranscript {
  /**
   * Fetch transcript from YTB Video
   * @param videoId Video url or video identifier
   * @param config Object with lang param (eg: en, es, hk, uk) format.
   * Will just the grab first caption if it can find one, so no special lang caption support.
   */
  static async fetchTranscript(videoId, config = {}) {
    const identifier = this.retrieveVideoId(videoId);
    const lang = config?.lang ?? "en";
    try {
      const transcriptUrl = await fetch(
        `https://www.youtube.com/watch?v=${identifier}`,
        {
          headers: {
            "User-Agent": USER_AGENT,
          },
        }
      )
        .then((res) => res.text())
        .then((html) => parse(html))
        .then((html) => this.#parseTranscriptEndpoint(html, lang));

      if (!transcriptUrl)
        throw new Error("Failed to locate a transcript for this video!");

      // Result is hopefully some XML.
      const transcriptXML = await fetch(transcriptUrl)
        .then((res) => res.text())
        .then((xml) => parse(xml));

      let transcript = "";
      const chunks = transcriptXML.getElementsByTagName("text");
      for (const chunk of chunks) {
        transcript += chunk.textContent;
      }

      return transcript;
    } catch (e) {
      throw new YoutubeTranscriptError(e);
    }
  }

  static #parseTranscriptEndpoint(document, langCode = null) {
    try {
      // Get all script tags on document page
      const scripts = document.getElementsByTagName("script");

      // find the player data script.
      const playerScript = scripts.find((script) =>
        script.textContent.includes("var ytInitialPlayerResponse = {")
      );

      const dataString =
        playerScript.textContent
          ?.split("var ytInitialPlayerResponse = ")?.[1] //get the start of the object {....
          ?.split("};")?.[0] + // chunk off any code after object closure.
        "}"; // add back that curly brace we just cut.

      const data = JSON.parse(dataString.trim()); // Attempt a JSON parse
      const availableCaptions =
        data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || [];

      // If languageCode was specified then search for it's code, otherwise get the first.
      let captionTrack = availableCaptions?.[0];
      if (langCode)
        captionTrack =
          availableCaptions.find((track) =>
            track.languageCode.includes(langCode)
          ) ?? availableCaptions?.[0];

      return captionTrack?.baseUrl;
    } catch (e) {
      console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`);
      return null;
    }
  }

  /**
   * Retrieve video id from url or string
   * @param videoId video url or video id
   */
  static retrieveVideoId(videoId) {
    if (videoId.length === 11) {
      return videoId;
    }
    const matchId = videoId.match(RE_YOUTUBE);
    if (matchId && matchId.length) {
      return matchId[1];
    }
    throw new YoutubeTranscriptError(
      "Impossible to retrieve Youtube video ID."
    );
  }
}

module.exports = {
  YoutubeTranscript,
  YoutubeTranscriptError,
};
sbbeez commented 7 months ago

My code relied entirely on the TranscriptionResponse schema from the package, leading to the entire API breaking down.

I made a slight adjustment to @timothycarambat 's code (which, by the way, functions flawlessly 🫡) to ensure we maintain the same signature:

   // use the following code snippet at the end of `fetchTranscript` 
    for (const chunk of chunks) {
      const [offset, duration] = chunk.rawAttrs.split(" ");
      const convertToMs = (text: string) =>
        parseFloat(text.split("=")[1].replace(/"/g, "")) * 1000;
      transcriptions.push({
        text: chunk.text,
        offset: convertToMs(offset),
        duration: convertToMs(duration),
      });
    }
dcsan commented 7 months ago

is there a new published version?

dcsan commented 7 months ago

if anyone wants a typescript version, here it is slightly cleaned up compared to above

// https://github.com/Kakulukian/youtube-transcript/issues/19
// If anybody is looking at making a class for this. Here is my currently working example for a quick patch. No promises it will keep working :)

import { parse } from "node-html-parser"
const RE_YOUTUBE =
  /(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=)|youtu\.be\/)([^"&?\/\s]{11})/i
const USER_AGENT =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)"

class YoutubeTranscriptError extends Error {
  constructor(message: string) {
    super(`[YoutubeTranscript] ${message}`)
  }
}

type YtFetchConfig = {
  lang?: string // Object with lang param (eg: en, es, hk, uk) format.
}

/**
 * Class to retrieve transcript if exist
 */
class YoutubeGrabTool {
  /**
   * Fetch transcript from YTB Video
   * @param videoId Video url or video identifier
   * @param config Object with lang param (eg: en, es, hk, uk) format.
   * Will just the grab first caption if it can find one, so no special lang caption support.
   */
  static async fetchTranscript(videoId: string, config: YtFetchConfig = {}) {
    const identifier = this.retrieveVideoId(videoId)
    const lang = config?.lang ?? "en"
    try {
      const transcriptUrl = await fetch(
        `https://www.youtube.com/watch?v=${identifier}`,
        {
          headers: {
            "User-Agent": USER_AGENT,
          },
        }
      )
        .then((res) => res.text())
        .then((html) => parse(html))
        .then((html) => this.#parseTranscriptEndpoint(html, lang))

      if (!transcriptUrl)
        throw new Error("Failed to locate a transcript for this video!")

      // Result is hopefully some XML.
      const transcriptXML = await fetch(transcriptUrl)
        .then((res) => res.text())
        .then((xml) => parse(xml))

      const chunks = transcriptXML.getElementsByTagName("text")

      function convertToMs(text: string) {
        const float = parseFloat(text.split("=")[1].replace(/"/g, "")) * 1000
        return Math.round(float)
      }

      let transcriptions = []
      for (const chunk of chunks) {
        const [offset, duration] = chunk.rawAttrs.split(" ")
        transcriptions.push({
          text: chunk.text,
          offset: convertToMs(offset),
          duration: convertToMs(duration),
        })
      }
      return transcriptions
    } catch (e: any) {
      throw new YoutubeTranscriptError(e)
    }
  }

  static #parseTranscriptEndpoint(document: any, langCode?: string) {
    try {
      // Get all script tags on document page
      const scripts = document.getElementsByTagName("script")

      // find the player data script.
      const playerScript = scripts.find((script: any) =>
        script.textContent.includes("var ytInitialPlayerResponse = {")
      )

      const dataString =
        playerScript.textContent
          ?.split("var ytInitialPlayerResponse = ")?.[1] //get the start of the object {....
          ?.split("};")?.[0] + // chunk off any code after object closure.
        "}" // add back that curly brace we just cut.

      const data = JSON.parse(dataString.trim()) // Attempt a JSON parse
      const availableCaptions =
        data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || []

      // If languageCode was specified then search for it's code, otherwise get the first.
      let captionTrack = availableCaptions?.[0]
      if (langCode)
        captionTrack =
          availableCaptions.find((track: any) =>
            track.languageCode.includes(langCode)
          ) ?? availableCaptions?.[0]

      return captionTrack?.baseUrl
    } catch (e: any) {
      console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`)
      return null
    }
  }

  /**
   * Retrieve video id from url or string
   * @param videoId video url or video id
   */
  static retrieveVideoId(videoId: string) {
    if (videoId.length === 11) {
      return videoId
    }
    const matchId = videoId.match(RE_YOUTUBE)
    if (matchId && matchId.length) {
      return matchId[1]
    }
    throw new YoutubeTranscriptError("Impossible to retrieve Youtube video ID.")
  }
}

export { YoutubeGrabTool, YoutubeTranscriptError }

used like

      const transcriptChunks = await YoutubeGrabTool.fetchTranscript(videoUrl)

changed the name so I can uninstall the other one.

alexmartinezm commented 7 months ago

Hope we get a fix version asap. Thanks for your contributions!

AbbasPlusPlus commented 7 months ago

I've made a TS version of the class @timothycarambat created. I've also added support for youtube shorts in the regex:

import { parse } from 'node-html-parser';
const USER_AGENT =
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)';

export class YoutubeTranscriptError extends Error {
  constructor(message: string) {
    super(`[YoutubeTranscript] ${message}`);
  }
}

export class YoutubeTranscript {
  /**
   * Fetch transcript from YouTube Video
   * @param videoId Video url or video identifier
   * @param config Object with lang param (eg: en, es, hk, uk) format.
   * Will just grab the first caption if it can find one, so no special lang caption support.
   */
  static async fetchTranscript(videoId: string, config: { lang?: string } = {}) {
    const identifier = this.retrieveVideoId(videoId);
    const lang = config?.lang ?? 'en';
    try {
      const transcriptUrl = await fetch(`https://www.youtube.com/watch?v=${identifier}`, {
        headers: {
          'User-Agent': USER_AGENT,
        },
      })
        .then((res) => res.text())
        .then((html) => parse(html))
        .then((html) => this.parseTranscriptEndpoint(html, lang));

      if (!transcriptUrl) throw new Error('Failed to locate a transcript for this video!');

      const transcriptXML = await fetch(transcriptUrl)
        .then((res) => res.text())
        .then((xml) => parse(xml));

      let transcript = '';
      const chunks = transcriptXML.getElementsByTagName('text');
      for (const chunk of chunks) {
        transcript += chunk.textContent + ' ';
      }

      return transcript.trim();
    } catch (e) {
      throw new YoutubeTranscriptError(e.message);
    }
  }

  private static parseTranscriptEndpoint(document: any, langCode: string | null = null) {
    try {
      const scripts = document.getElementsByTagName('script');
      const playerScript = scripts.find((script: any) =>
        script.textContent.includes('var ytInitialPlayerResponse = {')
      );

      const dataString = playerScript.textContent?.split('var ytInitialPlayerResponse = ')?.[1]?.split('};')?.[0] + '}';

      const data = JSON.parse(dataString.trim());
      const availableCaptions = data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || [];

      let captionTrack = availableCaptions?.[0];
      if (langCode) {
        captionTrack =
          availableCaptions.find((track: any) => track.languageCode.includes(langCode)) ?? availableCaptions?.[0];
      }

      return captionTrack?.baseUrl;
    } catch (e) {
      console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`);
      return null;
    }
  }

  /**
   * Retrieve video id from url or string
   * @param videoId video url or video id
   */
  static retrieveVideoId(videoId: string) {
    const regex =
      /(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=|shorts\/)|youtu\.be\/)([^"&?\/\s]{11})/i;
    const matchId = videoId.match(regex);
    if (matchId && matchId.length) {
      return matchId[1];
    }
    throw new YoutubeTranscriptError('Impossible to retrieve Youtube video ID.');
  }
}
piktur commented 7 months ago

Thansk @AbbasPlusPlus have handled shorts URL in this #21 . Also provides a test suite.

const RE_PATH = /v|e(?:mbed)?|shorts/;

// ...

export const getVideoId = (videoUrlOrId: string): string | null => {
  if (!videoUrlOrId) {
    return null
  }

  if (videoUrlOrId.length === ID_LENGTH) {
    return videoUrlOrId;
  }

  try {
    const url = new URL(videoUrlOrId);
    const segments = url.pathname.split('/');

    if (segments[1]?.length === ID_LENGTH) {
      return segments[1];
    }

    return (
      (RE_PATH.test(segments[1]) ? segments[2] : url.searchParams.get('v')) ||
      null
    );
  } catch (err) {
    return null;
  }
};
sinedied commented 7 months ago

@timothycarambat @sbbeez Thanks a LOT, you just saved my demo for a session I have this afternoon 🙏 ❤️

canadaduane commented 7 months ago

Thanks to all who've helped out. I wrapped the typescript up in a fork that implements the above as a package, with dist/ included so we could replace (for now) the existing package. https://github.com/SchoolAI/youtube-transcript

We include it as follows in our package.json:

    "youtube-transcript": "github:schoolai/youtube-transcript#6455ee21aab22e631f0c290df21b9e34e10adc4f",

(Note that this is not directly API compatible, as the function above changed the API)

sinedied commented 7 months ago

@canadaduane Could you include the changes from @sbbeez ? It makes code API compatible with the existing youtube-transcript package 🙂

piktur commented 7 months ago

Thanks @canadaduane I see you used the earlier PR, but I recommend merging the latter https://github.com/Kakulukian/youtube-transcript/pull/21 and rebuilding. It replaces cheerio with node-html-parser, implements broader support for all youtube URLs and provides a test suite.

@sinedied the public interface of the above linked PR preserves the former interface. YoutubeTranscript.retrieveVideoId has been promoted to public property.

Screenshot 2024-04-12 at 08 33 13
SahilPulikal commented 7 months ago

For those who are facing this issue, you can refer below solution: Originally posted by @sinedied in https://github.com/langchain-ai/langchainjs/issues/4994#issuecomment-2049952545 it worked for me.

For a temporary workaround until this is fixed upstream, you can:

npm i https://github.com/sinedied/youtube-transcript\#a10a073ac325b3b88018f321fa1bc5d62fa69b1c

This will use my fork that use a compatible drop-in code replacement from Kakulukian/youtube-transcript#19, all the code credits goes to the folks there.

When the issue is fixed upstream, you can simply:

npm rm youtube-transcript

and it will return to the upstream version.

Thank you so much @sinedied .

Medullitus commented 6 months ago

where is fix?? i didn't understand anything!!

Medullitus commented 6 months ago

I've made a TS version of the class @timothycarambat created. I've also added support for youtube shorts in the regex:

import { parse } from 'node-html-parser';
const USER_AGENT =
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)';

export class YoutubeTranscriptError extends Error {
  constructor(message: string) {
    super(`[YoutubeTranscript] ${message}`);
  }
}

export class YoutubeTranscript {
  /**
   * Fetch transcript from YouTube Video
   * @param videoId Video url or video identifier
   * @param config Object with lang param (eg: en, es, hk, uk) format.
   * Will just grab the first caption if it can find one, so no special lang caption support.
   */
  static async fetchTranscript(videoId: string, config: { lang?: string } = {}) {
    const identifier = this.retrieveVideoId(videoId);
    const lang = config?.lang ?? 'en';
    try {
      const transcriptUrl = await fetch(`https://www.youtube.com/watch?v=${identifier}`, {
        headers: {
          'User-Agent': USER_AGENT,
        },
      })
        .then((res) => res.text())
        .then((html) => parse(html))
        .then((html) => this.parseTranscriptEndpoint(html, lang));

      if (!transcriptUrl) throw new Error('Failed to locate a transcript for this video!');

      const transcriptXML = await fetch(transcriptUrl)
        .then((res) => res.text())
        .then((xml) => parse(xml));

      let transcript = '';
      const chunks = transcriptXML.getElementsByTagName('text');
      for (const chunk of chunks) {
        transcript += chunk.textContent + ' ';
      }

      return transcript.trim();
    } catch (e) {
      throw new YoutubeTranscriptError(e.message);
    }
  }

  private static parseTranscriptEndpoint(document: any, langCode: string | null = null) {
    try {
      const scripts = document.getElementsByTagName('script');
      const playerScript = scripts.find((script: any) =>
        script.textContent.includes('var ytInitialPlayerResponse = {')
      );

      const dataString = playerScript.textContent?.split('var ytInitialPlayerResponse = ')?.[1]?.split('};')?.[0] + '}';

      const data = JSON.parse(dataString.trim());
      const availableCaptions = data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || [];

      let captionTrack = availableCaptions?.[0];
      if (langCode) {
        captionTrack =
          availableCaptions.find((track: any) => track.languageCode.includes(langCode)) ?? availableCaptions?.[0];
      }

      return captionTrack?.baseUrl;
    } catch (e) {
      console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`);
      return null;
    }
  }

  /**
   * Retrieve video id from url or string
   * @param videoId video url or video id
   */
  static retrieveVideoId(videoId: string) {
    const regex =
      /(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=|shorts\/)|youtu\.be\/)([^"&?\/\s]{11})/i;
    const matchId = videoId.match(regex);
    if (matchId && matchId.length) {
      return matchId[1];
    }
    throw new YoutubeTranscriptError('Impossible to retrieve Youtube video ID.');
  }
}

THX But How will we use this ??

PaulBratslavsky commented 6 months ago

Weren't the changes merged already? So you should be able to use the library.

Gitmaxd commented 6 months ago

The error is still present. Just installed.

dcsan commented 6 months ago

you can install from this URL as someone documented above:

npm i https://github.com/sinedied/youtube-transcript\#a10a073ac325b3b88018f321fa1bc5d62fa69b1c

Medullitus commented 6 months ago

you can install from this URL as someone documented above:

npm i https://github.com/sinedied/youtube-transcript\#a10a073ac325b3b88018f321fa1bc5d62fa69b1c

How are we going to install from there? We are using the download section in Obsidian?

Medullitus commented 6 months ago

@Kakulukian Why are you closing unresolved topic threads? People are trying to solve issues in your application. Meanwhile, you're not even addressing them!

PaulBratslavsky commented 6 months ago

@Medullitus what issue are you still running into. I just use the library directly. It works in my current project.

Medullitus commented 6 months ago

@Medullitus what issue are you still running into. I just use the library directly. It works in my current project.

When I try to add YT video url and push the "Generate summary" button it gives me error! The error is that "Error: [YoutubeTranscript] TypeError: Cannot read properties of undefinied (reading 'transcriptBodyRenderer'). So I can't use the plugin...

Medullitus commented 6 months ago

HELLOOOOO

dcsan commented 6 months ago

FYI you seem to be confused here there is no "button" this is an NPM library to use to write your own code.

We are using the download section in Obsidian

this is not the repo for any obsidian plugin.

HELLOOOOO

chill out a bit, nobody is getting paid to solve your problem and your questions are so widely off base that it's clear you need to spend some time to gather a base level of information yourself and maybe find the right support channel for whatever tool you're using.

edit: my guess is there's some obsidian plugin that uses this library (this repo) and they need to update their code to use the updated version of this library. Perhaps the error message shown lead you mistakenly to come here. So you need to find the right support channel for that plugin and go and annoy them.

Medullitus commented 6 months ago

FYI you seem to be confused here there is no "button" this is an NPM library to use to write your own code.

We are using the download section in Obsidian

this is not the repo for any obsidian plugin.

HELLOOOOO

chill out a bit, nobody is getting paid to solve your problem and your questions are so widely off base that it's clear you need to spend some time to gather a base level of information yourself and maybe find the right support channel for whatever tool you're using.

edit: my guess is there's some obsidian plugin that uses this library (this repo) and they need to update their code to use the updated version of this library. Perhaps the error message shown lead you mistakenly to come here. So you need to find the right support channel for that plugin and go and annoy them.

Hello. How are you? I'm very sorry, I came here from the link on the Youtube Summarizer's GitHub page. It's really an important plugin for Obsidian, but it's not working. If you understand these things, is it possible for you to take a look? What do I need to do? Thanks...

https://github.com/ozdemir08/youtube-video-summarizer/issues/14