Benjamin-Loison / YouTube-operational-API

YouTube operational API works when YouTube Data API v3 fails.
397 stars 49 forks source link

Make a tool automatically minimizing HTTPS requests while keeping a given output #171

Open Benjamin-Loison opened 1 year ago

Benjamin-Loison commented 1 year ago

Maybe can choose a preferred HTTPS request format between Firefox and Chromium for instance. This algorithm doesn't seem compatible with Chromium, it would be interesting to add it, as Chromium has some features that Firefox doesn't have. Furthermore, if someone provides me a different cURL request that what I expect and is using Chromium then this tool is quite useless as is, this case happened. The concerned Chromium request seems to have expired.

In theory no translation is necessary, just a global support.

Cf tools/simplifyCURL.py.

Benjamin-Loison commented 1 year ago

The following doesn't remove correctly --compressed curl parameter, as it may then output compressed data uncomparable with wanted output parameter (tested with https://www.youtube.com/feed/subscriptions):

@@ -41,17 +45,19 @@ while True:
     arguments = shlex.split(command)
     for argumentsIndex in range(len(arguments) - 1):
         argument, nextArgument = arguments[argumentsIndex : argumentsIndex + 2]
-        if argument == '-H':
-            previousCommand = command
-            del arguments[argumentsIndex : argumentsIndex + 2]
-            command = shlex.join(arguments)
-            if isCommandStillFine(command):
-                print(len(command), 'still fine')
-                changedSomething = True
-                break
-            else:
-                command = previousCommand
-                arguments = shlex.split(command)
+        interested = [['-H', 1], ['--compressed', 0]]
+        for interestedArgument, offset in interested:
+            if argument == interestedArgument:
+                previousCommand = command
+                del arguments[argumentsIndex : argumentsIndex + 1 + offset]
+                command = shlex.join(arguments)
+                if isCommandStillFine(command):
+                    print(len(command), 'still fine')
+                    changedSomething = True
+                    break
+                else:
+                    command = previousCommand
+                    arguments = shlex.split(command)
     if not changedSomething:
         break
Benjamin-Loison commented 1 year ago

Can use the following to add a pre-request that reset the website state.

diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 352c813..db2a8cc 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -28,6 +28,9 @@ removeUrlParameters = True
 removeCookies = True
 removeRawData = True

+with open('resetCurlCommand.txt') as f:
+    resetCommand = f.read()
+
 # Pay attention to provide a command giving plaintext output, so might required to remove `Accept-Encoding` HTTPS header.
 with open(curlCommandFilePath) as f:
     command = f.read()
@@ -39,6 +42,7 @@ def executeCommand(command):
     return result

 def isCommandStillFine(command):
+    executeCommand(resetCommand)
     result = executeCommand(command)
     return wantedOutput in result
     '''data = json.loads(result)
Benjamin-Loison commented 1 year ago

Have 64 still fine repeated 33 times when working with https://www.youtube.com/playlist?list=WL This issue happens when end up with -H 'Cookie: ' in minimizedCurl.txt after removing all cookies. Unable to reproduce for cookies. Same issue with https://secure2.ldlc.com/fr-fr/Orders/PartialCompletedOrderContent?orderId for --raw-data.

Also have an inconsistency length issue it seems, as have the following when on a YouTube video remove it from Watch Later playlist:

4907
Removing raw data
5047 still fine
Benjamin-Loison commented 1 year ago

Could still simplify some cookies, as some websites such as YouTube put multiple values within a cookie, for instance: PREF=f6=40000001&f7=140&hl=en&tz=Europe.Paris&f5=20000&f4=4000000&autoplay=true.

Benjamin-Loison commented 1 year ago

Verify automatically the command that we are about to declare as minimized, as it may not be minimized as this command doesn't work anymore due to number of requests restrictions. It's the case with https://www.sncf-connect.com/bff/api/v1/itineraries for instance. Could also precise an exit response, such that if we face it, then it stops the algorithm.

Can for instance use:

diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index dc21ee6..bdd19de 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -17,11 +17,14 @@ from urllib.parse import urlparse, parse_qs, quote_plus

 # Could precise the input file and possibly remove the output one as the minimized requests start to be short.
 if len(sys.argv) < 3:
-    print('Usage: ./minimizeCURL curlCommand.txt "Wanted output"')
+    print('Usage: ./minimizeCURL curlCommand.txt "Wanted output" <Unwanted output>')
     exit(1)

 curlCommandFilePath = sys.argv[1]
 wantedOutput = sys.argv[2]
+unwantedOutput = None
+if len(sys.argv) > 3:
+    unwantedOutput = sys.argv[3]

 # The purpose of these parameters is to reduce requests done when developing this script:
 removeHeaders = True
@@ -41,6 +44,9 @@ def executeCommand(command):

 def isCommandStillFine(command):
     result = executeCommand(command)
+    if unwantedOutput != None and unwantedOutput in result:
+        print(f'The unwanted output is contained in the command: {command}!')
+        exit(1)
     return wantedOutput in result

 print(len(command))
Benjamin-Loison commented 1 year ago

Replacing true by false, when can't remove the boolean entry, makes the entry longer and possibly modify its behavior, so let's not do that.

Benjamin-Loison commented 9 months ago

If no wanted output is specified, just being identical to initial request response could be interesting, as in some cases (notably for other websites) it is unclear what to focus on.

Benjamin-Loison commented 9 months ago
diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 6d1dec3..3f75dea 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -36,6 +36,7 @@ with open(curlCommandFilePath) as f:
 def executeCommand(command):
     # `stderr = subprocess.DEVNULL` is used to get rid of curl progress.
     # Could also add `-s` curl argument.
+       input('Ready to send request?')
     result = subprocess.check_output(f'{command}', shell = True, stderr = subprocess.DEVNULL).decode('utf-8')
     return result

to manually interact with the website to make requests return a given output.

I designed it for inbox Mark all as read but it anyway returns:

{
    "UnreadMessagesCount": 0
}

Could otherwise specify if the wanted behavior was observed, but then heavily rely on the human operator and maybe the script currently does not minimize such interactions.

Benjamin-Loison commented 8 months ago

To add regex support being backward compatible:

diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index a572567..4280f0e 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -17,6 +17,7 @@ import subprocess
 import json
 import copy
 import sys
+import re
 from urllib.parse import urlparse, parse_qs, quote_plus

 # Could precise the input file and possibly remove the output one as the minimized requests start to be short.
@@ -45,7 +46,7 @@ def executeCommand(command):

 def isCommandStillFine(command):
     result = executeCommand(command)
-    return wantedOutput in result
+    return re.search(wantedOutput, result) is not None

 print(len(command))
 # To verify that the user provided the correct `wantedOutput` to keep during the minimization.
import re

wantedOutput = 'good'
#wantedOutput = 'good|better'

resultState = 'good'
#resultState = 'better'
#resultState = 'bad'
result = f'a {resultState} result'

print(f'\'{wantedOutput}\' in \'{result}\'')
print(wantedOutput in result)

print(f'search \'{wantedOutput}\' in \'{result}\'')
print(re.search(wantedOutput, result) is not None)
Benjamin-Loison commented 8 months ago

Could minimize the URL as well for instance maybe. For instance Carrefour challenge fidélités, cf Improve_websites_thanks_to_open_source/issues/417:

https://b2c-api.challengefid.b2c.untienots.com/api/v2/core/customers/CENSORED/campaigns/175/challenges

Benjamin-Loison commented 8 months ago

What about Selenium required URLs due to Cloudflare?

import requests
from lxml import html

url = 'https://www.researchgate.net/profile/Chiara_Albisani/publication/352171921_Checking_PRNU_Usability_on_Modern_Devices/links/61cc5407b6b5667157b22ded/Checking-PRNU-Usability-on-Modern-Devices.pdf?origin=publicationDetail&_sg%5B0%5D=nDOwA7YEmKF9WaeOFPip9QacTaWMZE_dVTnos-xYJ9s8HGUvzglhWluJPxoHq9TiaoJLAvAetAgmeNKx-wvU8A.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_sg%5B1%5D=P40TtMRMXR5y8PhDtDwlWd4lbaBgQ_AwTo8rKHMW_8eu5hXOju_PDtHb5iPLJvA1hovOr_H7PtlG6kYJMXceQr3PWnCmrl1tQs_DSf26T6Ly.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_iepl=&_rtd=eyJjb250ZW50SW50ZW50IjoibWFpbkl0ZW0ifQ%3D%3D&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIiwicG9zaXRpb24iOiJwYWdlSGVhZGVyIn19'

text = requests.get(url).text
tree = html.fromstring(text)
cloudFlareBlocked = tree.xpath('//title')[0].text_content() == 'Just a moment...'
print(cloudFlareBlocked) # `True`

Well in my case it is a PDF, hence page_source is not loading every page, so it is not a question of timing.

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
#options.add_argument('-headless')
browser = webdriver.Firefox(options=options)

browser.get(url)
print('expose a sort of diagonal correlation patter' in browser.page_source) # `False`
browser.close()

I was trying to play with options.set_preference('pdfjs.disabled', True) but then browser.get(url) was not terminating while the download has finished.

In fact just as usually copy-pasting cURL from Firefox does the job.

Removing URL parameters is problematic due to URL decoding for:

curl 'https://www.researchgate.net/profile/Chiara_Albisani/publication/352171921_Checking_PRNU_Usability_on_Modern_Devices/links/61cc5407b6b5667157b22ded/Checking-PRNU-Usability-on-Modern-Devices.pdf?origin=publicationDetail&_sg%5B0%5D=nDOwA7YEmKF9WaeOFPip9QacTaWMZE_dVTnos-xYJ9s8HGUvzglhWluJPxoHq9TiaoJLAvAetAgmeNKx-wvU8A.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_sg%5B1%5D=P40TtMRMXR5y8PhDtDwlWd4lbaBgQ_AwTo8rKHMW_8eu5hXOju_PDtHb5iPLJvA1hovOr_H7PtlG6kYJMXceQr3PWnCmrl1tQs_DSf26T6Ly.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_iepl=&_rtd=eyJjb250ZW50SW50ZW50IjoibWFpbkl0ZW0ifQ%3D%3D&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIiwicG9zaXRpb24iOiJwYWdlSGVhZGVyIn19' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0' -H 'Sec-Fetch-Site: cross-site'
Benjamin-Loison commented 7 months ago

Just parsing JavaScript statistically would be better in case the ytInitialData script location changes. Otherwise could loop on all scripts.

from lxml import html

with open('videos.html') as f:
    content = f.read()

tree = html.fromstring(content)
for scriptIndex, script in enumerate(tree.xpath('//script')):
    # `ytInitialData = ` is more precise but let us ease human work but not replace it completely.
    if 'ytInitialData' in script.text_content():
        print(scriptIndex)
36
38
diff --git a/tools/getJSONPathFromKey.py b/tools/getJSONPathFromKey.py
index e7493d0..dd73dad 100755
--- a/tools/getJSONPathFromKey.py
+++ b/tools/getJSONPathFromKey.py
@@ -26,6 +26,7 @@ As there are potentially multiple JavaScript variable names you can provide as t

 import sys
 import json
+from lxml import html

 def treatKey(obj, path, key):
     objKey = obj[key]
@@ -76,9 +77,12 @@ with open(filePath) as f:
 if not isJSON:
     with open(filePath) as f:
         content = f.read()
-    # Should use a HTML and JavaScript parser instead of proceeding that way.
+
+    # Should use a JavaScript parser instead of proceeding that way.
     # Same comment concerning `getJSONStringFromHTMLScriptPrefix`, note that both parsing methods should be identical.
-    newContent = content.split(ytVariableName + ' = ')[1].split(';<')[0]
+    tree = html.fromstring(content)
+    scriptContent = tree.xpath('//script')[36].text_content()
+    newContent = scriptContent.split(ytVariableName + ' = ')[1][:-1]
     with open(filePath, 'w') as f:
         f.write(newContent)

Same output with a new identical URL videos.html. What about other YouTube pages? Got 35 and 37 with second page of https://www.youtube.com/@Squeezie/videos

Benjamin-Loison commented 7 months ago

Make the condition being content being identical and not part being part of result. If want to hardcode the wanted result:

diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 67ce9cd..507921a 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -25,7 +25,9 @@ if len(sys.argv) < 3:
     exit(1)

 curlCommandFilePath = sys.argv[1]
-wantedOutput = sys.argv[2].encode('utf-8')
+wantedOutputFilePath = sys.argv[2].encode('utf-8')
+with open(wantedOutputFilePath) as f:
+    wantedOutput = f.read()

 # The purpose of these parameters is to reduce requests done when developing this script:
 removeHeaders = True
@@ -45,7 +47,7 @@ def executeCommand(command):

 def isCommandStillFine(command):
     result = executeCommand(command)
-    return wantedOutput in result
+    return wantedOutput == result

 print(len(command))
 # To verify that the user provided the correct `wantedOutput` to keep during the minimization.

but an easier approach consists in just considering the initial command result, as the command is already hardcoded:

diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 67ce9cd..2828d7e 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -20,12 +20,11 @@ import sys
 from urllib.parse import urlparse, parse_qs, quote_plus

 # Could precise the input file and possibly remove the output one as the minimized requests start to be short.
-if len(sys.argv) < 3:
-    print('Usage: ./minimizeCURL curlCommand.txt "Wanted output"')
+if len(sys.argv) < 2:
+    print('Usage: ./minimizeCURL curlCommand.txt')
     exit(1)

 curlCommandFilePath = sys.argv[1]
-wantedOutput = sys.argv[2].encode('utf-8')

 # The purpose of these parameters is to reduce requests done when developing this script:
 removeHeaders = True
@@ -43,16 +42,13 @@ def executeCommand(command):
     result = subprocess.check_output(command, shell = True, stderr = subprocess.DEVNULL)
     return result

+wantedOutput = executeCommand(command)
+
 def isCommandStillFine(command):
     result = executeCommand(command)
-    return wantedOutput in result
+    return wantedOutput == result

 print(len(command))
-# To verify that the user provided the correct `wantedOutput` to keep during the minimization.
-if not isCommandStillFine(command):
-    print('The wanted output isn\'t contained in the result of the original curl command!')
-    exit(1)
-
 if removeHeaders:
     print('Removing headers')

Related to Improve_websites_thanks_to_open_source/issues/353#issuecomment-1725724.

Benjamin-Loison commented 4 months ago

This tool does not remove anchors.