Use the CC file instead of the transcript

reaper-sid commented 1 year ago

Youtube recently decided to merge multiple lines of the CC into each single line of the transcript. This makes youtube2Anki much less useful. I found that the CC file can be pulled as XML. You can find the links to the various CC files in the HTML of the video page below a section that looks like "captions":{"playerCaptionsTracklistRenderer":{"captionTracks":.

After replacing \u0026 with &, the URLs look like this:

https://www.youtube.com/api/timedtext?v=[video_id]&caps=asr&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=[expire_code]&sparams=ip,ipbits,expire,v,caps,xoaf&signature=[signature_code]&key=yt8&lang=en

Would it be possible to rewrite to use the CC file from those links instead of the transcript for a more granular set of data and timing?

Originally posted by @tube-CC in https://github.com/dobladov/youtube2Anki/discussions/40

dobladov commented 1 year ago

Youtube recently decided to merge multiple lines of the CC into each single line of the transcript. This makes youtube2Anki much less useful.

Can you provide an example for this? As far as I know, YouTube does not decide how people upload their subtitles, it can be that whoever create the CC made them with multiple lines at once.

Would it be possible to rewrite to use the CC file from those links instead of the transcript for a more granular set of data and timing?

How can I get the signature_code? Can you put an example link that returns a valid XML?

As for now, I would say this is a huge refactor of the code and I would prefer not to do it since it looks to be some undocumented API that can lose access at any moment, while the current solution of parsing the HTML can be adapted easily in case of changes.

It would be interesting if we could implement the XML and have he HTML parsing as a fallback, but I would like to see how much work this would be.

reaper-sid commented 1 year ago

A working link can be parsed from any youtube watch page that has CC. I can't provide a working link here because the signature_code and the expire_code change each time the page is refreshed. More work for sure, but I have basically stopped using this extension because the transcripts are so bad now. Transcripts don't break for end of sentences or even change of speaker. I don't know why Youtube made this change, but it has made my language study harder.

dobladov commented 1 year ago

Please share the link to one of those videos you mention without breaks at the end of sentences, I will try to get the signature_code and exprire_code and compare to what the extension obtains. Thanks.

reaper-sid commented 1 year ago

https://www.youtube.com/watch?v=V_v5Gcjgv3U If you click on the transcript at 4:29, for example, you can see what I mean. Two people are talking. the CC gives six lines,

<text start="269.76" dur="0.6">Dr. Bai.</text>
<text start="270.88" dur="0.68">Why are you still here?</text>
<text start="271.92" dur="0.6">Time for meeting.</text>
<text start="273" dur="0.88">Only with patience</text>
<text start="273.96" dur="0.72">can you enjoy some</text>
<text start="274.68" dur="1.16">good hand grounding coffee.</text>

the transcript gives one. 4:29 4:39 Dr. Bai. Why are you still here? Time for meeting. Only with patience can you enjoy some good hand grounding coffee. ... 269 280

I might add that I use the timing to embed the videos directly into Anki. Not sure everybody is using it that way.

dobladov commented 1 year ago

I see what you mean now, I guess they did it to avoid a long list for transcript.

Your proposal makes a lot of sense, I check how to implement it when I have some free time.

reaper-sid commented 1 year ago

Cool! Let me know if you need any assistance for example with testing, etc.

dobladov commented 1 year ago

I managed to get the data into the extension, the refactor for handling the data will take me a while, but it seems to be very worth it because it does not require users to have to manually open the transcript any more.

reaper-sid commented 1 year ago

Wow, I'm amazed that you were able do to that so quickly! Are you going to give the user the ability to select which CC language they want to use within the extension?

dobladov commented 1 year ago

Are you going to give the user the ability to select which CC language they want to use within the extension?

Yes I got all the caption links, the user will have to select a language, this way there's only one URL to request the XML

dobladov commented 1 year ago

@tube-CC I was able to make a beta with the functionality you ask, It takes the captions you mentioned from the script with the ytInitialData, but there's a big problem, since that data is not reloaded after navigating to another video this information can only be loaded once at the beginning; If you have any idea of how to get updated caption information let me know, so I can finish the feature and add it to the next release.

https://user-images.githubusercontent.com/1938043/213730280-60e6cbe0-a042-4cad-a8c4-6bc042501d10.mov

If you want to give it a test, you can use this package: chrome://extensions/ -> Developer Mode -> Load unpacked -> The folder of the unpackd extension

youtube2anki-1.4.0.zip

reaper-sid commented 1 year ago

Testing the Beta Using this playlist: https://www.youtube.com/playlist?list=PL6xVgUZ4UP2O_6Y4pRSVmVwT34NRJHJz0

When I visit the first Episode, the extension gives me a language to choose from popup, which then takes me to the CC content. Clicking "Delete saved cards" takes me back to the language choice list.
If I then visit another Episode and click the extension, I get the "Transcript not found" popup. If I have the transcript window open, it pulls that content into the extension rather than the CC content.
If I follow step 1 and then visit another Episode and then click the browser's "Reload this page" button, the extension gives me a language to choose from popup, which then takes me to the CC content. Reloading the page appears to reload the extension.

dobladov commented 1 year ago

is the issue I have, when you select another episode, the initial data from where I take the captions is no longer valid so I fallback to the previous system of getting the data from the view, I need a way to refresh this data. Thanks for checking

reaper-sid commented 1 year ago

Can all of the pulling and parsing be done at the time the extension button is clicked rather than when the page/extension is loaded? I suppose it would seem slower to the end user, but at least it would cause the data to reload on click. Or could the extension reload on page load even if the previous page was on youtube.com?

reaper-sid commented 1 year ago

So, I got it to work. Starting on line 72 in the popup.js `// Try to get the captions from the UI

            const { subtitles } = await chrome.tabs.sendMessage(id, { type: 'getSubtitles', title, storageId })

            if (subtitles) {

              mainState.subtitles = subtitles

              mainState.view = 'list'

              return

            } else {

              chrome.runtime.reload()

            }`

You can see I added the elsewhich forces the extension to reload. This is a hack, and probably has side effects, but you get the idea that I'm going for. You know your code, and probably can come up with a better implementation.

dobladov commented 1 year ago

Can all of the pulling and parsing be done at the time the extension button is clicked rather than when the page/extension is loaded? I suppose it would seem slower to the end user, but at least it would cause the data to reload on click.

This is what it does already, like you pointed either the extension reloads the page or I should find a way to get the data that YouTube queries when another video is loaded, I got access to the yt object but since it's an undocumented API I'm not sure how to get the captions.

I considered the reload, but I would like to keep it as a last resort solution.

dobladov / youtube2Anki

Use the CC file instead of the transcript #42