_extract_akamai_formats weird problem

nixxo commented 3 years ago

Checklist

[ ] I'm reporting a broken site support issue
[x] I've verified that I'm running youtube-dlc version 2020.10.31
[x] I've checked that all provided URLs are alive and playable in a browser
[x] I've checked that all URLs and arguments with special characters are properly quoted or escaped
[x] I've searched the bugtracker for similar bug reports including closed ones
[x] I've read bugs section in FAQ

Verbose log

Traceback (most recent call last):
  File "C:\Users\Utente\scoop\apps\python\current\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Utente\scoop\apps\python\current\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\__main__.py", line 19, in <module>
    youtube_dlc.main()
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\__init__.py", line 488, in main
    _real_main(argv)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\__init__.py", line 478, in _real_main
    retcode = ydl.download(all_urls)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 2130, in download
    res = self.extract_info(
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 841, in extract_info
    return self.__extract_info(url, ie, download, extra_info, process, info_dict)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 849, in wrapper
    return func(self, *args, **kwargs)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\YoutubeDL.py", line 870, in __extract_info
    ie_result = ie.extract(url)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\extractor\common.py", line 534, in extract
    ie_result = self._real_extract(url)
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\extractor\gedi.py", line 38, in _real_extract
    formats = self._extract_akamai_formats(
  File "C:\Users\Utente\Progetti\yt-dlc\youtube_dlc\extractor\common.py", line 2645, in _extract_akamai_formats
    http_url = re.sub(
  File "C:\Users\Utente\scoop\apps\python\current\lib\re.py", line 210, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Users\Utente\scoop\apps\python\current\lib\re.py", line 327, in _subx
    template = _compile_repl(template, pattern)
  File "C:\Users\Utente\scoop\apps\python\current\lib\re.py", line 318, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "C:\Users\Utente\scoop\apps\python\current\lib\sre_parse.py", line 1036, in parse_template
    addgroup(int(this[1:]), len(this) - 1)
  File "C:\Users\Utente\scoop\apps\python\current\lib\sre_parse.py", line 980, in addgroup
    raise s.error("invalid group reference %d" % index, pos)
re.error: invalid group reference 11 at position 29

Description

i'm experimenting a bit with an extractor and I'm trying to use _extract_akamai_formats in common.py it basically takes the hls manifest url to recreate the http direct url for the mp4 of the file.

but it seems that some m3u8 manifest creates some problem with the re.sub function that I cannot understand.

the line that generates the problem is this one:

http_url = re.sub( REPL_REGEX, protocol + r'://%s/\1%s\3' % ( http_host, qualities[i] ), f['url'] )

but if I recreate every step of the same line the code is executed without problems.

reg = re.search(REPL_REGEX, f['url'])
g1 = reg.group(1)
g3 = reg.group(3)
http_url = protocol + '://%s/%s%s%s' % (http_host, g1, qualities[i], g3)

reading the traceback log is seems to me that it's a problem with the regex library. Can somebody explain it to me?

pukkandan commented 3 years ago

please give an example with the url and value of variables http_host and qualities[i].

Without any additional info, my guess is that the variables have \ somewhere which is being interpreted by the regex as a reference

nixxo commented 3 years ago

sry for the little infos... here's more:

so, the manifest url that works is

https://videodemand-vh.akamaihd.net/i/encoded/2020/11/22/1606032590423_uomo-ucciso-da-uno-squale-in-australia_,web_low,web_med,web_high,web_hd,.mp4.csmil/index_0_av.m3u8?null=0 and using the REPL_REGEX it "extracts" the tuple

#0 tuple(3)
    [0] => str(72) "encoded/2020/11/22/1606032590423_uomo-ucciso-da-uno-squale-in-australia_"
    [1] => str(31) "web_low,web_med,web_high,web_hd"
    [2] => str(4) ".mp4"

generating the mp4 direct url

http://videoplatform.sky.it/encoded/2020/11/22/1606032590423_uomo-ucciso-da-uno-squale-in-australia_web_low.mp4

Instead the manifest that creates problem is: https://gediusod-vh.akamaihd.net/i/repubblicatv/file/2020/09/22/731397/731397-video-rrtv-,650,200,400,1200,1800,2500,3500,4500,-s200922_iacoboni_salvini.mp4.csmil/index_3_av.m3u8?null=0 that generates the tuple

#0 tuple(3)
    [0] => str(54) "repubblicatv/file/2020/09/22/731397/731397-video-rrtv-"
    [1] => str(36) "650,200,400,1200,1800,2500,3500,4500"
    [2] => str(29) "-s200922_iacoboni_salvini.mp4"

and the resulting mp4 url is wrong: http://media.gedidigital.it/J00-s200922_iacoboni_salvini.mp4

but, like I said, if I do the same oparation just one step at a time it works. Only in the "condensed" way it generates problems.

nixxo commented 3 years ago

ok, figured out the problem.

the replacement is r'://%s/\1%s\3' % ( http_host, qualities[i] ) but if qualities is a number it is a problem because it becomes attatched to the \1 and becomes \1number and it fucks up the regex.

nixxo commented 3 years ago

ok, solution found: https://github.com/ytdl-org/youtube-dl/commit/193422e12a98ebcc49a215cf3667c7fce593f25c#commitcomment-44741426

instead of \1 use \g<1>

blackjack4494 / yt-dlc