Benjamin-Loison / YouTube-operational-API

YouTube operational API works when YouTube Data API v3 fails.
401 stars 52 forks source link

Add `executable = '/usr/bin/bash'` to `subprocess.check_output` may be necessary to support `curl` `--data-raw $'...'` #293

Open Benjamin-Loison opened 3 months ago

Benjamin-Loison commented 3 months ago

https://github.com/Benjamin-Loison/YouTube-operational-API/blob/d61488fbe0becf6d2a6ebc97761e5b87a8facd3f/tools/minimizeCURL.py#L43

See the Unix Stack Exchange answer 115614.

http://wiki.bash-hackers.org/syntax/quoting is a quite empty page (even from source point of view and https:// does not help). https://www.gnu.org/software/bash/manual/html_node/ANSI_002dC-Quoting.html

Benjamin-Loison commented 3 months ago

It seems that currently the algorithm rebuilds the command as --data-raw '$...' which is an issue in the context of Benjamin_Loison/OneDrive/issues/6.

Benjamin-Loison commented 3 months ago

Also before blob/main/tools/minimizeCURL.py#L151-L228 have to be considered as it rebuilds incorrectly the command.

Benjamin-Loison commented 3 months ago

A shameful, probably introducing a security flaw, fix is:

command = command.replace(" --data-raw '$", " --data-raw $'")

But to put where? I guess in isCommandStillFine.

Benjamin-Loison commented 2 months ago

Related to #171.

Benjamin-Loison commented 2 months ago

Alternatively can replace --data-raw $'...' with --data-raw "..." it seems.

Maybe just '...' does not interpret some characters like \r and \n. It seems that "..." too even if escape them and ".

Even escaping \, does not make "..." work for:

URL: ``` -----BEGIN PGP MESSAGE----- hF4DTQa9Wom5MBgSAQdA1OEw/bqOs0qI8sDf/mCyaHXumnfef2o9xpB9zMvZKHIw adXxpGpfchCvld/9+2gr9w+T2mvKcv3IRt6sJEOPSC4lsnxIDxKXEByRu5jBn+FP 0qQBWg3M5tUM1m1LT7G8SW+x7nG5Rl0ksfRfzUoQXY/MShLuoOTheSR3Nw33217Y FOVIbAybZ8uY5dPJVka+aOZ0LNSw4i6QVEn5rmbju3qANxE5LTw0146HzGjaaVCz 89RkG7i3Fum9FfYw/AaBPYSekj8RvDXJP4lmXnQcudKtp8pGIvfvstytxLHvU7Gd 84qi3eiVWyqoiw1oy7ghg4/+3n5Tag== =H5Cv -----END PGP MESSAGE----- ```

So I exceptionally minimized by hand.

Benjamin-Loison commented 2 months ago
echo $'-----------------------------XXXXXXXXXXXXXXXXXXXXXXXXXXXX\r\nContent-Disposition: form-data; name="no_individu"\r\n\r\nXXXXXX\r\n-----------------------------XXXXXXXXXXXXXXXXXXXXXXXXXXXX\r\nContent-Disposition: form-data; name="acti"\r\n\r\nXX\r\n-----------------------------XXXXXXXXXXXXXXXXXXXXXXXXXXXX--\r\n'
Output: ``` -----------------------------XXXXXXXXXXXXXXXXXXXXXXXXXXXX Content-Disposition: form-data; name="no_individu" XXXXXX -----------------------------XXXXXXXXXXXXXXXXXXXXXXXXXXXX Content-Disposition: form-data; name="acti" XX -----------------------------XXXXXXXXXXXXXXXXXXXXXXXXXXXX-- ```

I am unable to reproduce these new lines without $.

Otherwise maybe could rely on a command line converter.

The Unix Stack Exchange answer 48122 may help as well as its comments.

Benjamin-Loison commented 2 months ago

Related to Webscrap_any_website/issues/29.

Benjamin-Loison commented 2 months ago
import shlex

command = "curl --data-raw $'\''"
# command = "curl --data-raw $'a'"
# works fine

shlex.split(command)
ValueError: No closing quotation: ``` Traceback (most recent call last): File "", line 6, in shlex.split(command) File "/usr/lib/python3.12/shlex.py", line 313, in split return list(lex) File "/usr/lib/python3.12/shlex.py", line 300, in __next__ token = self.get_token() File "/usr/lib/python3.12/shlex.py", line 109, in get_token raw = self.read_token() File "/usr/lib/python3.12/shlex.py", line 191, in read_token raise ValueError("No closing quotation") ValueError: No closing quotation ```
Benjamin-Loison commented 2 months ago
help(shlex.split)
Output: ``` Help on function split in module shlex: split(s, comments=False, posix=True) Split the string *s* using shell-like syntax. ```
Benjamin-Loison commented 2 months ago

https://docs.python.org/3.12/library/shlex.html#shlex.split

Benjamin-Loison commented 2 months ago

Removing $ does not help.

import shlex

command = "curl --data-raw $'\''"

print(shlex.split(command, posix = False))
['curl', '--data-raw', "$'''"]
Benjamin-Loison commented 2 months ago
help(shlex.join)
Output: ``` Help on function join in module shlex: join(split_command) Return a shell-escaped string from *split_command*. ```

https://docs.python.org/3.12/library/shlex.html#shlex.join

Benjamin-Loison commented 2 months ago
import shlex

command = "curl --data-raw $'\''"

commandSplitted = shlex.split(command, posix = False)
print(shlex.join(commandSplitted))
curl --data-raw '$'"'"''"'"''"'"''
Benjamin-Loison commented 2 months ago
print(' '.join(commandSplitted))
curl --data-raw $'''
Benjamin-Loison commented 2 months ago
command = "curl --data-raw $'\''"
print(command)
curl --data-raw $'''
Benjamin-Loison commented 2 months ago
import shlex

command = "curl --data-raw $'\\''"
# Equivalent to above `command`.
with open('curl.sh') as f:
    command = f.read()
print(command)

commandSplitted = shlex.split(command, posix = False)
print(shlex.join(commandSplitted))
print(' '.join(commandSplitted))
Output: ``` curl --data-raw $'\'' curl --data-raw '$'"'"'\'"'"''"'"'' curl --data-raw $'\'' ```
Benjamin-Loison commented 2 months ago

Using ' '.join(...) requires to manage quoting arguments on our own it seems.

Benjamin-Loison commented 2 months ago
Diff: ```diff diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py index 2b5a721..65dac18 100755 --- a/tools/minimizeCURL.py +++ b/tools/minimizeCURL.py @@ -31,7 +31,7 @@ wantedOutput = sys.argv[2].encode('utf-8') removeHeaders = True removeUrlParameters = True removeCookies = True -removeRawData = True +removeRawData = False # Pay attention to provide a command giving plaintext output, so might required to remove `Accept-Encoding` HTTPS header. with open(curlCommandFilePath) as f: @@ -40,7 +40,7 @@ with open(curlCommandFilePath) as f: def executeCommand(command): # `stderr = subprocess.DEVNULL` is used to get rid of curl progress. # Could also add `-s` curl argument. - result = subprocess.check_output(command, shell = True, stderr = subprocess.DEVNULL) + result = subprocess.check_output(command, shell = True, stderr = subprocess.DEVNULL, executable = '/usr/bin/bash') return result def isCommandStillFine(command): @@ -62,41 +62,50 @@ if not isCommandStillFine(command): print('The wanted output isn\'t contained in the result of the original curl command!') exit(1) +def splitCommand(command): + return shlex.split(command, posix = False) + +def joinSplittedCommand(spittedCommand): + return ' '.join(spittedCommand) + return shlex.join(spittedCommand) + if removeHeaders: print('Removing headers') # Should try to minimize the number of requests done, by testing half of parameters at each request. while True: changedSomething = False - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex in range(len(arguments) - 1): argument, nextArgument = arguments[argumentsIndex : argumentsIndex + 2] if argument == '-H': previousCommand = command del arguments[argumentsIndex : argumentsIndex + 2] - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True break else: command = previousCommand - arguments = shlex.split(command) + arguments = splitCommand(command) if not changedSomething: break if removeUrlParameters: print('Removing URL parameters') - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex, argument in enumerate(arguments): - if argument.startswith('http'): + if argument.startswith("'http"): urlIndex = argumentsIndex + #arguments[urlIndex] = arguments[urlIndex][1:-1] break url = arguments[urlIndex] while True: changedSomething = False + url = url[1:-1] urlParsed = urlparse(url) query = parse_qs(urlParsed.query) for key in list(query): @@ -104,8 +113,8 @@ if removeUrlParameters: del query[key] # Make a function with below code. url = urlParsed._replace(query = '&'.join([f'{quote_plus(parameter)}={quote_plus(query[parameter][0])}' for parameter in query])).geturl() - arguments[urlIndex] = url - command = shlex.join(arguments) + arguments[urlIndex] = shlex.quote(url) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True @@ -113,8 +122,8 @@ if removeUrlParameters: else: query = previousQuery url = urlParsed._replace(query = '&'.join([f'{quote_plus(parameter)}={quote_plus(query[parameter][0])}' for parameter in query])).geturl() - arguments[urlIndex] = url - command = shlex.join(arguments) + arguments[urlIndex] = shlex.quote(url) + command = joinSplittedCommand(arguments) if not changedSomething: break @@ -125,7 +134,7 @@ if removeCookies: COOKIES_PREFIX_LEN = len(COOKIES_PREFIX) cookiesIndex = None - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex, argument in enumerate(arguments): # For Chromium support: if argument[:COOKIES_PREFIX_LEN].title() == COOKIES_PREFIX: @@ -142,7 +151,7 @@ if removeCookies: cookiesParsedCopy = cookiesParsed[:] del cookiesParsedCopy[cookiesParsedIndex] arguments[cookiesIndex] = COOKIES_PREFIX + '; '.join(cookiesParsedCopy) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True @@ -150,7 +159,7 @@ if removeCookies: break else: arguments[cookiesIndex] = COOKIES_PREFIX + '; '.join(cookiesParsed) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if not changedSomething: break @@ -159,7 +168,7 @@ if removeRawData: rawDataIndex = None isJson = False - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex, argument in enumerate(arguments): if argumentsIndex > 0 and arguments[argumentsIndex - 1] == '--data-raw': rawDataIndex = argumentsIndex @@ -182,7 +191,7 @@ if removeRawData: rawDataPartsCopy = copy.deepcopy(rawDataParts) del rawDataPartsCopy[rawDataPartsIndex] arguments[rawDataIndex] = '&'.join(rawDataPartsCopy) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True @@ -190,7 +199,7 @@ if removeRawData: break else: arguments[rawDataIndex] = '&'.join(rawDataParts) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if not changedSomething: break # JSON recursive case. @@ -229,7 +238,7 @@ if removeRawData: del entry[lastPathPart] # Test if the removed entry was necessary. arguments[rawDataIndex] = json.dumps(rawDataParsedCopy) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) # (1) If it was unnecessary, then reconsider paths excluding possible children paths of this unnecessary entry, ensuring optimized complexity it seems. if isCommandStillFine(command): printThatCommandIsStillFine(command) @@ -239,7 +248,7 @@ if removeRawData: # If it was necessary, we consider possible children paths of this necessary entry and other paths. else: arguments[rawDataIndex] = json.dumps(rawDataParsed) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) # If a loop iteration considering all paths, does not change anything, then the request cannot be minimized further. if not changedSomething: break ```
Benjamin-Loison commented 2 months ago

For big file modified above diff with:

Output: ```diff diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py index 2b5a721..6e281ea 100755 --- a/tools/minimizeCURL.py +++ b/tools/minimizeCURL.py @@ -31,7 +31,7 @@ wantedOutput = sys.argv[2].encode('utf-8') removeHeaders = True removeUrlParameters = True removeCookies = True -removeRawData = True +removeRawData = False # Pay attention to provide a command giving plaintext output, so might required to remove `Accept-Encoding` HTTPS header. with open(curlCommandFilePath) as f: @@ -40,7 +40,12 @@ with open(curlCommandFilePath) as f: def executeCommand(command): # `stderr = subprocess.DEVNULL` is used to get rid of curl progress. # Could also add `-s` curl argument. - result = subprocess.check_output(command, shell = True, stderr = subprocess.DEVNULL) + INTERMEDIARY_CURL_FILE_PATH = 'intermediary_curl.sh' + with open(INTERMEDIARY_CURL_FILE_PATH, 'w') as f: + f.write(command) + result = subprocess.check_output(f'bash {INTERMEDIARY_CURL_FILE_PATH}', shell = True, stderr = subprocess.DEVNULL) + #print(result) + #exit(1) return result def isCommandStillFine(command): @@ -56,6 +61,13 @@ def printThatCommandIsStillFine(command): # For Chromium support: command = command.replace(' \\\n ', '') +def splitCommand(command): + return shlex.split(command, posix = False) + +def joinSplittedCommand(spittedCommand): + return ' '.join(spittedCommand) + return shlex.join(spittedCommand) + print(f'Initial command length: {getCommandLengthFormatted(command)}.') # To verify that the user provided the correct `wantedOutput` to keep during the minimization. if not isCommandStillFine(command): @@ -68,35 +80,37 @@ if removeHeaders: # Should try to minimize the number of requests done, by testing half of parameters at each request. while True: changedSomething = False - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex in range(len(arguments) - 1): argument, nextArgument = arguments[argumentsIndex : argumentsIndex + 2] if argument == '-H': previousCommand = command del arguments[argumentsIndex : argumentsIndex + 2] - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True break else: command = previousCommand - arguments = shlex.split(command) + arguments = splitCommand(command) if not changedSomething: break if removeUrlParameters: print('Removing URL parameters') - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex, argument in enumerate(arguments): - if argument.startswith('http'): + if argument.startswith("'http"): urlIndex = argumentsIndex + #arguments[urlIndex] = arguments[urlIndex][1:-1] break url = arguments[urlIndex] while True: changedSomething = False + url = url[1:-1] urlParsed = urlparse(url) query = parse_qs(urlParsed.query) for key in list(query): @@ -104,8 +118,8 @@ if removeUrlParameters: del query[key] # Make a function with below code. url = urlParsed._replace(query = '&'.join([f'{quote_plus(parameter)}={quote_plus(query[parameter][0])}' for parameter in query])).geturl() - arguments[urlIndex] = url - command = shlex.join(arguments) + arguments[urlIndex] = shlex.quote(url) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True @@ -113,8 +127,8 @@ if removeUrlParameters: else: query = previousQuery url = urlParsed._replace(query = '&'.join([f'{quote_plus(parameter)}={quote_plus(query[parameter][0])}' for parameter in query])).geturl() - arguments[urlIndex] = url - command = shlex.join(arguments) + arguments[urlIndex] = shlex.quote(url) + command = joinSplittedCommand(arguments) if not changedSomething: break @@ -125,7 +139,7 @@ if removeCookies: COOKIES_PREFIX_LEN = len(COOKIES_PREFIX) cookiesIndex = None - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex, argument in enumerate(arguments): # For Chromium support: if argument[:COOKIES_PREFIX_LEN].title() == COOKIES_PREFIX: @@ -142,7 +156,7 @@ if removeCookies: cookiesParsedCopy = cookiesParsed[:] del cookiesParsedCopy[cookiesParsedIndex] arguments[cookiesIndex] = COOKIES_PREFIX + '; '.join(cookiesParsedCopy) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True @@ -150,7 +164,7 @@ if removeCookies: break else: arguments[cookiesIndex] = COOKIES_PREFIX + '; '.join(cookiesParsed) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if not changedSomething: break @@ -159,7 +173,7 @@ if removeRawData: rawDataIndex = None isJson = False - arguments = shlex.split(command) + arguments = splitCommand(command) for argumentsIndex, argument in enumerate(arguments): if argumentsIndex > 0 and arguments[argumentsIndex - 1] == '--data-raw': rawDataIndex = argumentsIndex @@ -182,7 +196,7 @@ if removeRawData: rawDataPartsCopy = copy.deepcopy(rawDataParts) del rawDataPartsCopy[rawDataPartsIndex] arguments[rawDataIndex] = '&'.join(rawDataPartsCopy) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if isCommandStillFine(command): printThatCommandIsStillFine(command) changedSomething = True @@ -190,7 +204,7 @@ if removeRawData: break else: arguments[rawDataIndex] = '&'.join(rawDataParts) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) if not changedSomething: break # JSON recursive case. @@ -229,7 +243,7 @@ if removeRawData: del entry[lastPathPart] # Test if the removed entry was necessary. arguments[rawDataIndex] = json.dumps(rawDataParsedCopy) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) # (1) If it was unnecessary, then reconsider paths excluding possible children paths of this unnecessary entry, ensuring optimized complexity it seems. if isCommandStillFine(command): printThatCommandIsStillFine(command) @@ -239,7 +253,7 @@ if removeRawData: # If it was necessary, we consider possible children paths of this necessary entry and other paths. else: arguments[rawDataIndex] = json.dumps(rawDataParsed) - command = shlex.join(arguments) + command = joinSplittedCommand(arguments) # If a loop iteration considering all paths, does not change anything, then the request cannot be minimized further. if not changedSomething: break ```
Benjamin-Loison commented 1 month ago

Copy as {PowerShell,Fetch} do not seem to help.

On Windows:

Copy as cURL (Windows) does not use $'. But does not work as wanted on Linux according to echo to a file and diff with Copy as cURL (POSIX). Copy as cURL (POSIX) uses $'.

What about Chromium on Linux and Windows? See Online_authentication_API/issues/78#issuecomment-2405882.

Benjamin-Loison commented 1 month ago

DuckDuckGo and Google search Command POSIX to Windows converter do not return relevant results it seems.

Benjamin-Loison commented 1 month ago

Note that echo $"a\nb" does not work while echo $'a\nb' does.

Benjamin-Loison commented 1 month ago

Maybe latest Python version of shlex supports this feature.

In fact splitting echo $'a\nb' and echo 'a\nb' should return something like ['echo', 'a\nb']. It is unclear how should keep the $'...' information.

Benjamin-Loison commented 1 month ago

Can maybe somehow not need $'...' by expand newlines (but does not seem to apply to other usages of $'...') or use a given file for --data-raw it is possible as far as I remember.

Benjamin-Loison commented 1 month ago
COMMAND = "echo $'a\\nb'"
print(COMMAND)
print(shlex.split(COMMAND))
echo $'a\nb'
['echo', '$a\\nb']
COMMAND = "echo '$a\\nb'"
print(COMMAND)
print(shlex.split(COMMAND))
echo '$a\nb'
['echo', '$a\\nb']
echo '$a\nb'
$a\nb
COMMAND = "echo $'a\\nb'"
print(COMMAND)
print(shlex.split(COMMAND, posix = False))
echo $'a\nb'
['echo', "$'a\\nb'"]
COMMAND = "echo '$a\\nb'"
print(COMMAND)
print(shlex.split(COMMAND, posix = False))
echo '$a\nb'
['echo', "'$a\\nb'"]
Benjamin-Loison commented 1 month ago

In the case of Online authentication API, this issue may be due to my complex passwords, using, if possible, passwords without special characters may help. Otherwise use special character that do not seem to require some kind of escape as $'...' seems to provide.