Closed Ozanaydinn closed 7 months ago
Hello im having same problem @Ozanaydinn
I haven't checked in too much detail what is happening but I think it has something to do with YouTube itself.
Maybe this line of code is no longer getting the correct API.
It looks like the API must have changed. There is no longer a .body
property in actions[0].updateEngagementPanelAction.content.transcriptRenderer
:
In other words this line is failing:
const transcripts =
body.actions[0].updateEngagementPanelAction.content
.transcriptRenderer.body.transcriptBodyRenderer.cueGroups
Does anyone know more about the /timedtext
internal API? It seems to provide the transcript data, but is behind a signature
field.
Yeah, looks like the endpoint was killed - RIP. This script seems to emulate a few steps in the script but instead of getting INNERTUBE it gets the signed URL for that session to get /timedtext
Probably very flaky and requires an HTML parser
import { parse } from "node-html-parser";
const PAGE = await fetch("https://www.youtube.com/watch?v=bZQun8Y4L2A")
.then((res) => res.text())
.then((html) => parse(html));
const scripts = PAGE.getElementsByTagName("script");
const playerScript = scripts.find((script) =>
script.textContent.includes("var ytInitialPlayerResponse = {"),
);
const dataString = playerScript.textContent
?.split("var ytInitialPlayerResponse = ")?.[1]
?.slice(0, -1);
const data = JSON.parse(dataString.trim());
const captionsUrl =
data.captions.playerCaptionsTracklistRenderer.captionTracks[0].baseUrl;
const resXML = await fetch(captionsUrl)
.then((res) => res.text())
.then((xml) => parse(xml));
let transcript;
const chunks = resXML.getElementsByTagName("text");
for (const chunk of chunks) {
transcript += chunk.textContent;
}
console.log(transcript); // :)
Appreciate the quick fix!
If anybody is looking at making a class for this. Here is my currently working example for a quick patch. No promises it will keep working :)
const { parse } = require("node-html-parser");
const RE_YOUTUBE =
/(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=)|youtu\.be\/)([^"&?\/\s]{11})/i;
const USER_AGENT =
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)";
class YoutubeTranscriptError extends Error {
constructor(message) {
super(`[YoutubeTranscript] ${message}`);
}
}
/**
* Class to retrieve transcript if exist
*/
class YoutubeTranscript {
/**
* Fetch transcript from YTB Video
* @param videoId Video url or video identifier
* @param config Object with lang param (eg: en, es, hk, uk) format.
* Will just the grab first caption if it can find one, so no special lang caption support.
*/
static async fetchTranscript(videoId, config = {}) {
const identifier = this.retrieveVideoId(videoId);
const lang = config?.lang ?? "en";
try {
const transcriptUrl = await fetch(
`https://www.youtube.com/watch?v=${identifier}`,
{
headers: {
"User-Agent": USER_AGENT,
},
}
)
.then((res) => res.text())
.then((html) => parse(html))
.then((html) => this.#parseTranscriptEndpoint(html, lang));
if (!transcriptUrl)
throw new Error("Failed to locate a transcript for this video!");
// Result is hopefully some XML.
const transcriptXML = await fetch(transcriptUrl)
.then((res) => res.text())
.then((xml) => parse(xml));
let transcript = "";
const chunks = transcriptXML.getElementsByTagName("text");
for (const chunk of chunks) {
transcript += chunk.textContent;
}
return transcript;
} catch (e) {
throw new YoutubeTranscriptError(e);
}
}
static #parseTranscriptEndpoint(document, langCode = null) {
try {
// Get all script tags on document page
const scripts = document.getElementsByTagName("script");
// find the player data script.
const playerScript = scripts.find((script) =>
script.textContent.includes("var ytInitialPlayerResponse = {")
);
const dataString =
playerScript.textContent
?.split("var ytInitialPlayerResponse = ")?.[1] //get the start of the object {....
?.split("};")?.[0] + // chunk off any code after object closure.
"}"; // add back that curly brace we just cut.
const data = JSON.parse(dataString.trim()); // Attempt a JSON parse
const availableCaptions =
data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || [];
// If languageCode was specified then search for it's code, otherwise get the first.
let captionTrack = availableCaptions?.[0];
if (langCode)
captionTrack =
availableCaptions.find((track) =>
track.languageCode.includes(langCode)
) ?? availableCaptions?.[0];
return captionTrack?.baseUrl;
} catch (e) {
console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`);
return null;
}
}
/**
* Retrieve video id from url or string
* @param videoId video url or video id
*/
static retrieveVideoId(videoId) {
if (videoId.length === 11) {
return videoId;
}
const matchId = videoId.match(RE_YOUTUBE);
if (matchId && matchId.length) {
return matchId[1];
}
throw new YoutubeTranscriptError(
"Impossible to retrieve Youtube video ID."
);
}
}
module.exports = {
YoutubeTranscript,
YoutubeTranscriptError,
};
My code relied entirely on the TranscriptionResponse schema from the package, leading to the entire API breaking down.
I made a slight adjustment to @timothycarambat 's code (which, by the way, functions flawlessly 🫡) to ensure we maintain the same signature:
// use the following code snippet at the end of `fetchTranscript`
for (const chunk of chunks) {
const [offset, duration] = chunk.rawAttrs.split(" ");
const convertToMs = (text: string) =>
parseFloat(text.split("=")[1].replace(/"/g, "")) * 1000;
transcriptions.push({
text: chunk.text,
offset: convertToMs(offset),
duration: convertToMs(duration),
});
}
is there a new published version?
if anyone wants a typescript version, here it is slightly cleaned up compared to above
// https://github.com/Kakulukian/youtube-transcript/issues/19
// If anybody is looking at making a class for this. Here is my currently working example for a quick patch. No promises it will keep working :)
import { parse } from "node-html-parser"
const RE_YOUTUBE =
/(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=)|youtu\.be\/)([^"&?\/\s]{11})/i
const USER_AGENT =
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)"
class YoutubeTranscriptError extends Error {
constructor(message: string) {
super(`[YoutubeTranscript] ${message}`)
}
}
type YtFetchConfig = {
lang?: string // Object with lang param (eg: en, es, hk, uk) format.
}
/**
* Class to retrieve transcript if exist
*/
class YoutubeGrabTool {
/**
* Fetch transcript from YTB Video
* @param videoId Video url or video identifier
* @param config Object with lang param (eg: en, es, hk, uk) format.
* Will just the grab first caption if it can find one, so no special lang caption support.
*/
static async fetchTranscript(videoId: string, config: YtFetchConfig = {}) {
const identifier = this.retrieveVideoId(videoId)
const lang = config?.lang ?? "en"
try {
const transcriptUrl = await fetch(
`https://www.youtube.com/watch?v=${identifier}`,
{
headers: {
"User-Agent": USER_AGENT,
},
}
)
.then((res) => res.text())
.then((html) => parse(html))
.then((html) => this.#parseTranscriptEndpoint(html, lang))
if (!transcriptUrl)
throw new Error("Failed to locate a transcript for this video!")
// Result is hopefully some XML.
const transcriptXML = await fetch(transcriptUrl)
.then((res) => res.text())
.then((xml) => parse(xml))
const chunks = transcriptXML.getElementsByTagName("text")
function convertToMs(text: string) {
const float = parseFloat(text.split("=")[1].replace(/"/g, "")) * 1000
return Math.round(float)
}
let transcriptions = []
for (const chunk of chunks) {
const [offset, duration] = chunk.rawAttrs.split(" ")
transcriptions.push({
text: chunk.text,
offset: convertToMs(offset),
duration: convertToMs(duration),
})
}
return transcriptions
} catch (e: any) {
throw new YoutubeTranscriptError(e)
}
}
static #parseTranscriptEndpoint(document: any, langCode?: string) {
try {
// Get all script tags on document page
const scripts = document.getElementsByTagName("script")
// find the player data script.
const playerScript = scripts.find((script: any) =>
script.textContent.includes("var ytInitialPlayerResponse = {")
)
const dataString =
playerScript.textContent
?.split("var ytInitialPlayerResponse = ")?.[1] //get the start of the object {....
?.split("};")?.[0] + // chunk off any code after object closure.
"}" // add back that curly brace we just cut.
const data = JSON.parse(dataString.trim()) // Attempt a JSON parse
const availableCaptions =
data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || []
// If languageCode was specified then search for it's code, otherwise get the first.
let captionTrack = availableCaptions?.[0]
if (langCode)
captionTrack =
availableCaptions.find((track: any) =>
track.languageCode.includes(langCode)
) ?? availableCaptions?.[0]
return captionTrack?.baseUrl
} catch (e: any) {
console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`)
return null
}
}
/**
* Retrieve video id from url or string
* @param videoId video url or video id
*/
static retrieveVideoId(videoId: string) {
if (videoId.length === 11) {
return videoId
}
const matchId = videoId.match(RE_YOUTUBE)
if (matchId && matchId.length) {
return matchId[1]
}
throw new YoutubeTranscriptError("Impossible to retrieve Youtube video ID.")
}
}
export { YoutubeGrabTool, YoutubeTranscriptError }
used like
const transcriptChunks = await YoutubeGrabTool.fetchTranscript(videoUrl)
changed the name so I can uninstall the other one.
Hope we get a fix version asap. Thanks for your contributions!
I've made a TS version of the class @timothycarambat created. I've also added support for youtube shorts in the regex:
import { parse } from 'node-html-parser';
const USER_AGENT =
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)';
export class YoutubeTranscriptError extends Error {
constructor(message: string) {
super(`[YoutubeTranscript] ${message}`);
}
}
export class YoutubeTranscript {
/**
* Fetch transcript from YouTube Video
* @param videoId Video url or video identifier
* @param config Object with lang param (eg: en, es, hk, uk) format.
* Will just grab the first caption if it can find one, so no special lang caption support.
*/
static async fetchTranscript(videoId: string, config: { lang?: string } = {}) {
const identifier = this.retrieveVideoId(videoId);
const lang = config?.lang ?? 'en';
try {
const transcriptUrl = await fetch(`https://www.youtube.com/watch?v=${identifier}`, {
headers: {
'User-Agent': USER_AGENT,
},
})
.then((res) => res.text())
.then((html) => parse(html))
.then((html) => this.parseTranscriptEndpoint(html, lang));
if (!transcriptUrl) throw new Error('Failed to locate a transcript for this video!');
const transcriptXML = await fetch(transcriptUrl)
.then((res) => res.text())
.then((xml) => parse(xml));
let transcript = '';
const chunks = transcriptXML.getElementsByTagName('text');
for (const chunk of chunks) {
transcript += chunk.textContent + ' ';
}
return transcript.trim();
} catch (e) {
throw new YoutubeTranscriptError(e.message);
}
}
private static parseTranscriptEndpoint(document: any, langCode: string | null = null) {
try {
const scripts = document.getElementsByTagName('script');
const playerScript = scripts.find((script: any) =>
script.textContent.includes('var ytInitialPlayerResponse = {')
);
const dataString = playerScript.textContent?.split('var ytInitialPlayerResponse = ')?.[1]?.split('};')?.[0] + '}';
const data = JSON.parse(dataString.trim());
const availableCaptions = data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || [];
let captionTrack = availableCaptions?.[0];
if (langCode) {
captionTrack =
availableCaptions.find((track: any) => track.languageCode.includes(langCode)) ?? availableCaptions?.[0];
}
return captionTrack?.baseUrl;
} catch (e) {
console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`);
return null;
}
}
/**
* Retrieve video id from url or string
* @param videoId video url or video id
*/
static retrieveVideoId(videoId: string) {
const regex =
/(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=|shorts\/)|youtu\.be\/)([^"&?\/\s]{11})/i;
const matchId = videoId.match(regex);
if (matchId && matchId.length) {
return matchId[1];
}
throw new YoutubeTranscriptError('Impossible to retrieve Youtube video ID.');
}
}
Thansk @AbbasPlusPlus have handled shorts URL in this #21 . Also provides a test suite.
const RE_PATH = /v|e(?:mbed)?|shorts/;
// ...
export const getVideoId = (videoUrlOrId: string): string | null => {
if (!videoUrlOrId) {
return null
}
if (videoUrlOrId.length === ID_LENGTH) {
return videoUrlOrId;
}
try {
const url = new URL(videoUrlOrId);
const segments = url.pathname.split('/');
if (segments[1]?.length === ID_LENGTH) {
return segments[1];
}
return (
(RE_PATH.test(segments[1]) ? segments[2] : url.searchParams.get('v')) ||
null
);
} catch (err) {
return null;
}
};
@timothycarambat @sbbeez Thanks a LOT, you just saved my demo for a session I have this afternoon 🙏 ❤️
Thanks to all who've helped out. I wrapped the typescript up in a fork that implements the above as a package, with dist/
included so we could replace (for now) the existing package. https://github.com/SchoolAI/youtube-transcript
We include it as follows in our package.json:
"youtube-transcript": "github:schoolai/youtube-transcript#6455ee21aab22e631f0c290df21b9e34e10adc4f",
(Note that this is not directly API compatible, as the function above changed the API)
@canadaduane Could you include the changes from @sbbeez ? It makes code API compatible with the existing youtube-transcript package 🙂
Thanks @canadaduane I see you used the earlier PR, but I recommend merging the latter https://github.com/Kakulukian/youtube-transcript/pull/21 and rebuilding. It replaces cheerio with node-html-parser, implements broader support for all youtube URLs and provides a test suite.
@sinedied the public interface of the above linked PR preserves the former interface. YoutubeTranscript.retrieveVideoId
has been promoted to public property.
For those who are facing this issue, you can refer below solution: Originally posted by @sinedied in https://github.com/langchain-ai/langchainjs/issues/4994#issuecomment-2049952545 it worked for me.
For a temporary workaround until this is fixed upstream, you can:
npm i https://github.com/sinedied/youtube-transcript\#a10a073ac325b3b88018f321fa1bc5d62fa69b1c
This will use my fork that use a compatible drop-in code replacement from Kakulukian/youtube-transcript#19, all the code credits goes to the folks there.
When the issue is fixed upstream, you can simply:
npm rm youtube-transcript
and it will return to the upstream version.
Thank you so much @sinedied .
where is fix?? i didn't understand anything!!
I've made a TS version of the class @timothycarambat created. I've also added support for youtube shorts in the regex:
import { parse } from 'node-html-parser'; const USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)'; export class YoutubeTranscriptError extends Error { constructor(message: string) { super(`[YoutubeTranscript] ${message}`); } } export class YoutubeTranscript { /** * Fetch transcript from YouTube Video * @param videoId Video url or video identifier * @param config Object with lang param (eg: en, es, hk, uk) format. * Will just grab the first caption if it can find one, so no special lang caption support. */ static async fetchTranscript(videoId: string, config: { lang?: string } = {}) { const identifier = this.retrieveVideoId(videoId); const lang = config?.lang ?? 'en'; try { const transcriptUrl = await fetch(`https://www.youtube.com/watch?v=${identifier}`, { headers: { 'User-Agent': USER_AGENT, }, }) .then((res) => res.text()) .then((html) => parse(html)) .then((html) => this.parseTranscriptEndpoint(html, lang)); if (!transcriptUrl) throw new Error('Failed to locate a transcript for this video!'); const transcriptXML = await fetch(transcriptUrl) .then((res) => res.text()) .then((xml) => parse(xml)); let transcript = ''; const chunks = transcriptXML.getElementsByTagName('text'); for (const chunk of chunks) { transcript += chunk.textContent + ' '; } return transcript.trim(); } catch (e) { throw new YoutubeTranscriptError(e.message); } } private static parseTranscriptEndpoint(document: any, langCode: string | null = null) { try { const scripts = document.getElementsByTagName('script'); const playerScript = scripts.find((script: any) => script.textContent.includes('var ytInitialPlayerResponse = {') ); const dataString = playerScript.textContent?.split('var ytInitialPlayerResponse = ')?.[1]?.split('};')?.[0] + '}'; const data = JSON.parse(dataString.trim()); const availableCaptions = data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || []; let captionTrack = availableCaptions?.[0]; if (langCode) { captionTrack = availableCaptions.find((track: any) => track.languageCode.includes(langCode)) ?? availableCaptions?.[0]; } return captionTrack?.baseUrl; } catch (e) { console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`); return null; } } /** * Retrieve video id from url or string * @param videoId video url or video id */ static retrieveVideoId(videoId: string) { const regex = /(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=|shorts\/)|youtu\.be\/)([^"&?\/\s]{11})/i; const matchId = videoId.match(regex); if (matchId && matchId.length) { return matchId[1]; } throw new YoutubeTranscriptError('Impossible to retrieve Youtube video ID.'); } }
THX But How will we use this ??
Weren't the changes merged already? So you should be able to use the library.
The error is still present. Just installed.
you can install from this URL as someone documented above:
npm i https://github.com/sinedied/youtube-transcript\#a10a073ac325b3b88018f321fa1bc5d62fa69b1c
you can install from this URL as someone documented above:
npm i https://github.com/sinedied/youtube-transcript\#a10a073ac325b3b88018f321fa1bc5d62fa69b1c
How are we going to install from there? We are using the download section in Obsidian?
@Kakulukian Why are you closing unresolved topic threads? People are trying to solve issues in your application. Meanwhile, you're not even addressing them!
@Medullitus what issue are you still running into. I just use the library directly. It works in my current project.
@Medullitus what issue are you still running into. I just use the library directly. It works in my current project.
When I try to add YT video url and push the "Generate summary" button it gives me error! The error is that "Error: [YoutubeTranscript] TypeError: Cannot read properties of undefinied (reading 'transcriptBodyRenderer'). So I can't use the plugin...
HELLOOOOO
FYI you seem to be confused here there is no "button" this is an NPM library to use to write your own code.
We are using the download section in Obsidian
this is not the repo for any obsidian plugin.
HELLOOOOO
chill out a bit, nobody is getting paid to solve your problem and your questions are so widely off base that it's clear you need to spend some time to gather a base level of information yourself and maybe find the right support channel for whatever tool you're using.
edit: my guess is there's some obsidian plugin that uses this library (this repo) and they need to update their code to use the updated version of this library. Perhaps the error message shown lead you mistakenly to come here. So you need to find the right support channel for that plugin and go and annoy them.
FYI you seem to be confused here there is no "button" this is an NPM library to use to write your own code.
We are using the download section in Obsidian
this is not the repo for any obsidian plugin.
HELLOOOOO
chill out a bit, nobody is getting paid to solve your problem and your questions are so widely off base that it's clear you need to spend some time to gather a base level of information yourself and maybe find the right support channel for whatever tool you're using.
edit: my guess is there's some obsidian plugin that uses this library (this repo) and they need to update their code to use the updated version of this library. Perhaps the error message shown lead you mistakenly to come here. So you need to find the right support channel for that plugin and go and annoy them.
Hello. How are you? I'm very sorry, I came here from the link on the Youtube Summarizer's GitHub page. It's really an important plugin for Obsidian, but it's not working. If you understand these things, is it possible for you to take a look? What do I need to do? Thanks...
https://github.com/ozdemir08/youtube-video-summarizer/issues/14
Hello, I'm not sure if this project is still getting maintenance but still wanted to create an issue for this!
We are using v1.1.0 in our project and this morning suddenly we started getting this error :
I also tried to run the example code in a brand new project but still the same error, so I guess that eliminates any errors that might have happened on our side. Any help regarding the issue would be greatly appreciated!