Benjamin-Loison / yt-dlp

A youtube-dl fork with additional features and fixes
https://discord.gg/H5MNcFW63r
The Unlicense
0 stars 0 forks source link

Unable to download some video transcripts #1

Closed Benjamin-Loison closed 8 months ago

Benjamin-Loison commented 8 months ago

DO NOT REMOVE OR SKIP THE ISSUE TEMPLATE

Checklist

Provide a description that is worded well enough to be understood

https://stackoverflow.com/q/77716353 https://www.youtube.com/watch?v=1X7SZzJwNcU

https://superuser.com/a/927532 does not seem to solve this issue, as automatically generated captions download does not work and list captions does not list any too. https://stackoverflow.com/a/70013529 does not seem to solve this issue (tried both possibilities).

curl 'https://www.youtube.com/api/timedtext?v=1X7SZzJwNcU&ei=IhCLZYbqL8W5vdIP7KeYgAE&caps=asr&opi=112496729&xoaf=5&ip=0.0.0.0&ipbits=0&expire=1703637651&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=E8A613410F88CCC5BA375A3F864DB16A2CE24DC4.7B762771B28EDAA109B1467890F38780A437FE0A&key=yt8&kind=asr&lang=ar'

is the most simplified request providing the captions.

Here is an example for another video:

curl 'https://www.youtube.com/api/timedtext?v=BwOjb_ZJMUA&ei=vm-LZZ6uONCDp-oP95qzsAE&caps=asr&opi=112496729&xoaf=5&ip=0.0.0.0&ipbits=0&expire=1703662126&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=12E554F229E2B3EA797BA00EF0D6D086AD61E93D.7B4B0C7975EEB3D35DD4789888AEDA224A32EFB2&key=yt8&kind=asr&lang=ar'

Both up-to-date yt-dlp and youtube-dl suffer of this issue.

https://stackoverflow.com/a/69992807

Let us wait the first curl request expiration.

It does not return the expected output.

Here is the new working request:

curl 'https://www.youtube.com/api/timedtext?v=1X7SZzJwNcU&ei=1niLZZSrJZ64hcIP6I6Q-A4&caps=asr&opi=112496729&xoaf=5&ip=0.0.0.0&ipbits=0&expire=1703664454&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=D5D7A907D983AC0621F2858C4378B77BA367640F.C704BAD519ECACC0A745D8D4F76F203B32AEFE59&key=yt8&kind=asr&lang=ar'

ei, expire and signature differ. expire is about 7.5 hours later.

Provide verbose output that clearly demonstrates the problem

Complete Verbose Output

No response

kylefoley76 commented 8 months ago

Thanks for you help. I posted the question over at Stack Overflow yesterday under a different name. In this post, here however that API now has a bug listed here. I then used your python code listed here. It returns a result without an error but I'm assuming that I'm supposed to then use that info to construct a curl command but that step is not obvious to me. The curl functions listed above such as

curl 'https://www.youtube.com/api/timedtext?v=BwOjb_ZJMUA&ei=vm-LZZ6uONCDp-oP95qzsAE&caps=asr&opi=112496729&xoaf=5&ip=0.0.0.0&ipbits=0&expire=1703662126&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=12E554F229E2B3EA797BA00EF0D6D086AD61E93D.7B4B0C7975EEB3D35DD4789888AEDA224A32EFB2&key=yt8&kind=asr&lang=ar'

does work for me, but I cannot figure out how to reconstruct my own curl command to get another video's transcript.

Benjamin-Loison commented 8 months ago

@kylefoley76

May you precise what you mean by It in:

It returns a result without an error but I'm assuming that I'm supposed to then use that info to construct a curl command but that step is not obvious to me.

?

Note that, as I refered this Stack Overflow answer, I either:

cannot figure out how to reconstruct my own curl command to get another video's transcript

Benjamin-Loison commented 8 months ago

If you want to download, by proceeding manually for each video, I recommend getting your own cURL request thanks to Network by going on the https://www.youtube.com/watch?v=VIDEO_ID and look for XHR timedtext request.

Note that you can minimize the request thanks to minimizeCURL.py.

kylefoley76 commented 8 months ago

The youtube-transcript-api bug has been fixed so that basically did what I wanted. but thanks for your help anyway.

Benjamin-Loison commented 8 months ago

Indeed, thank you for letting me know.

Personal note:

_extract_captions_json is the key function, as it extracts:

{
    "captions": {
        "playerCaptionsTracklistRenderer": {
            "captionTracks": [
                {
                    "baseUrl": "https://www.youtube.com/api/timedtext?v=1X7SZzJwNcU&ei=6peNZbmrL8H4xN8Py7yc2AY&caps=asr&opi=112496729&xoaf=5&hl=fr&ip=0.0.0.0&ipbits=0&expire=1703803482&sparams=ip,ipbits,expire,v,ei,caps,opi,xoaf&signature=2C3A2E1474B1DEC7A7E839AA96F6B5DE8DA85F40.DB5FA10429DF9DA9C0DB407588B46911D764245F&key=yt8&kind=asr&lang=ar",
                    "name": {
                        "simpleText": "Arabe (g\u00e9n\u00e9r\u00e9s automatiquement)"
                    },
                    "vssId": "a.ar",
                    "languageCode": "ar",
                    "kind": "asr",
                    "rtl": true,
                    "isTranslatable": true,
                    "trackName": ""
                }
            ],
            "audioTracks": [
                {
                    "captionTrackIndices": [
                        0
                    ]
                }
            ],
            "defaultAudioTrackIndex": 0
        }
    }
}

I probably missed this signature as Firefox search progress in Network developer tools tab is unclear.