benwiley4000 / youtube-vtt

▶️ Extract and save WebVTT closed caption tracks from YouTube videos
MIT License
47 stars 13 forks source link

ytplayer? (issues with live stream) #1

Open vgoklani opened 4 years ago

vgoklani commented 4 years ago

Hey there, thanks for releasing this!

I followed your instructions but got this error:

VM26770:4 Uncaught ReferenceError: ytplayer is not defined

Where does this get defined?

benwiley4000 commented 4 years ago

@vgoklani in which context are you running the script? You should be on the page of a YouTube video (e.g. https://www.youtube.com/watch?v=OraxqbUjpHw) and if you navigated to that video from another page, I would refresh the page to make sure the JavaScript globals refer to the video you chose (apparently the old globals stick around otherwise).

Next you should open the developer tools JavaScript console and from there I would just follow the instructions in the readme... paste in the script, run the function to save the file(s), and it should just work. I just tried it. Let me know if anything else seems unclear. And let me know if you think anything in the readme should be updated.

Thanks!

vgoklani commented 4 years ago

Thanks for the response @benwiley4000 !

I wasn't able to get it to work for this video since it's a live stream.

https://www.youtube.com/watch?v=dp8PhLsUcFE

By any chance, do you know how to get the callback that gets called every time the captions get updated? I'm trying to process the real-time stream. Thanks!

benwiley4000 commented 4 years ago

Oh, I have no clue. If it updates the same info that we use to grab the captions I would try a loop using requestAnimationFrame or a setInterval. Otherwise I'm not sure off the top of my head. If you learn anything let me know! I'd love to improve this script to support all types of YouTube videos/streams.

dirxiang commented 4 years ago

Hi,

Thank you for sharing this! It worked for me to download a file for the English captions. But is it possible to extract a file for another language? I was hoping to get a file that has the captions in Chinese, which is auto-translated by youtube. Is it possible if it's auto-translated? Thanks!

benwiley4000 commented 4 years ago

@dirxiang I believe that should work. Could you share the video URL?

dirxiang commented 4 years ago

@dirxiang I believe that should work. Could you share the video URL?

the link is: https://www.youtube.com/watch?v=D4g8MmICJ8g&ab_channel=BCPSMagnetPrograms
Thanks!

benwiley4000 commented 4 years ago

@dirxiang ah, that's a new feature I don't think was available before. I don't support this currently but it can be added, I just tested. I'll open a new issue for this.

dirxiang commented 4 years ago

Wow, thanks so much!!!

On Oct 2, 2020, at 5:46 PM, Ben Wiley notifications@github.com wrote:



@dirxianghttps://github.com/dirxiang ah, that's a new feature I don't think was available before. I don't support this currently but it can be added, I just tested. I'll open a new issue for this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/benwiley4000/youtube-vtt/issues/1#issuecomment-702971707, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARHMX6FCL3YEETS6PMDEE2DSIZCZ7ANCNFSM4KGH7FMA.

benwiley4000 commented 4 years ago

Done (see other thread #2 ).

benwiley4000 commented 3 years ago

@vgoklani I finally found a solution for consuming captions from YouTube live streams as they become available. For now the API is quite different from the VTT download script for completed videos, but perhaps it can be adapted into the same tool.

Here is the usage. I'll paste the function below.

// starts consuming captions beginning now
var callback = console.log;
handleCaptionsStream(callback);

// starts consuming captions at a point in the past up until the present,
// and continues consuming captions as they become available
var callback = console.log;
var date = new Date();
// start 3 hours ago (YouTube seems to allow you to request
// up to a bit more than 7 days in the past if needed)
date.setHours(date.getHours() - 3);
handleCaptionsStream(callback, date);

Your callback will be triggered with the following data:

Screen Shot 2020-10-07 at 1 15 53 PM

A few things to note about the response:

Here's the function to be pasted into the JS console:

function handleCaptionsStream(callback, startTimestamp) {
  var playerResponse = JSON.parse(ytplayer.config.args.player_response);
  var captionsUrl = playerResponse.streamingData.adaptiveFormats.find(function (
    format
  ) {
    return format.mimeType.indexOf('text/') === 0;
  }).url;
  var domParser = new window.DOMParser();

  fetchCaptions().then(function (primaryInfo) {
    var beginningTimestamp =
      Date.now() - primaryInfo.streamProperties['Stream-Duration-Us'] / 1000;
    var startSequenceNumber = startTimestamp
      ? Math.round(
          ((startTimestamp - beginningTimestamp) * 1000) /
            primaryInfo.streamProperties['Target-Duration-Us']
        )
      : primaryInfo.streamProperties['Sequence-Number'];
    return fetchCaptionsUntilEnd(startSequenceNumber);
    function fetchCaptionsUntilEnd(sequenceNumber) {
      var timestamp =
        beginningTimestamp +
        (primaryInfo.streamProperties['Target-Duration-Us'] *
          primaryInfo.streamProperties['Sequence-Number']) /
          1000;
      return (timestamp > Date.now()
        ? waitUntil(timestamp)
        : Promise.resolve()
      ).then(function () {
        return fetchCaptions(sequenceNumber).then(function (info) {
          callback(info);
          if (info.streamProperties['Stream-Finished'] === 'F') {
            return fetchCaptionsUntilEnd(
              info.streamProperties['Sequence-Number'] + 1
            );
          }
        });
      });
    }
  });

  function fetchCaptions(sequenceNumber) {
    return fetchTextUntilContentReturned(
      captionsUrl +
        (sequenceNumber === undefined ? '' : '&sq=' + sequenceNumber)
    ).then(function (text) {
      var streamPropertiesContent = text.slice(
        text.indexOf('Sequence-Number:'),
        text.indexOf('\r\n\r\n')
      );
      var streamProperties = {};
      streamPropertiesContent.split('\n').forEach(function (line) {
        var lineParts = line.trim().split(': ');
        var key = lineParts[0];
        var value = lineParts[1];
        streamProperties[key] = isNaN(value) ? value : Number(value);
      });
      var xmlIndex = text.indexOf('<?xml ');
      var xmlContent = xmlIndex !== -1 ? text.slice(xmlIndex) : null;
      var xmlTree =
        xmlContent && domParser.parseFromString(xmlContent, 'text/xml');
      var unixTimestampRelative =
        (streamProperties['Sequence-Number'] *
          streamProperties['Target-Duration-Us']) /
        1000;
      var captions =
        xmlTree &&
        Array.prototype.map
          .call(xmlTree.querySelectorAll('p'), function (p) {
            var textContent = p.textContent;
            if (textContent.trim()) {
              var t = Number(p.getAttribute('t'));
              var d = Number(p.getAttribute('d'));
              var start = t + unixTimestampRelative;
              var end = start + d;
              return {
                text: textContent,
                start: start,
                end: end
              };
            }
          })
          .filter(Boolean);
      var webVttContent =
        captions &&
        captions
          .map(function (caption) {
            return (
              formatTime(caption.start / 1000) +
              ' --> ' +
              formatTime(caption.end / 1000) +
              '\n' +
              caption.text +
              '\n'
            );
          })
          .concat('')
          .join('\n');
      return {
        streamProperties: streamProperties,
        xmlContent: xmlContent,
        xmlTree: xmlTree,
        unixTimestampRelative: unixTimestampRelative,
        captions: captions,
        webVttContent: webVttContent
      };
    });
  }

  // for some reason we get an empty response sometimes
  function fetchTextUntilContentReturned(url) {
    return fetch(url)
      .then(function (res) {
        return res.text();
      })
      .then((text) => {
        return text || fetchTextUntilContentReturned(url);
      });
  }

  function waitUntil(unixTime) {
    return new Promise(function (resolve) {
      setTimeout(resolve, Math.max(0, unixTime - Date.now()));
    });
  }

  function pad2(number) {
    // thanks https://www.electrictoolbox.com/pad-number-two-digits-javascript/
    return (number < 10 ? '0' : '') + number;
  }

  function pad3(number) {
    return number >= 100 ? number : '0' + pad2(number);
  }

  // time: seconds
  function formatTime(time) {
    var hours = 0;
    var minutes = 0;
    var seconds = 0;
    var milliseconds = 0;
    while (time >= 60 * 60) {
      hours++;
      time -= 60 * 60;
    }
    while (time >= 60) {
      minutes++;
      time -= 60;
    }
    while (time >= 1) {
      seconds++;
      time -= 1;
    }
    milliseconds = (time * 1000).toFixed(0);
    return (
      pad2(hours) +
      ':' +
      pad2(minutes) +
      ':' +
      pad2(seconds) +
      '.' +
      pad3(milliseconds)
    );
  }
}
benwiley4000 commented 3 years ago

Also note that the script on master should work for you if you just need captions for a live stream that has completed already

benwiley4000 commented 3 years ago

I just updated the script above to include a webVttContent property which looks like this:

Screen Shot 2020-10-07 at 2 59 29 PM
benwiley4000 commented 3 years ago

Hm, seems like some of the math might be wrong now that I look at the timestamps in the vtt content. I can try to give this another look later.

benwiley4000 commented 3 years ago

The script now reflects correct timestamps in the webvtt content. There may still be a problem with captions overlapping (maybe they should be combined when on overlapping ranges), but they should work with the html video element.

Screen Shot 2020-10-07 at 4 51 01 PM