Open Benjamin-Loison opened 1 year ago
The following doesn't remove correctly --compressed
curl parameter, as it may then output compressed data uncomparable with wanted output parameter (tested with https://www.youtube.com/feed/subscriptions):
@@ -41,17 +45,19 @@ while True:
arguments = shlex.split(command)
for argumentsIndex in range(len(arguments) - 1):
argument, nextArgument = arguments[argumentsIndex : argumentsIndex + 2]
- if argument == '-H':
- previousCommand = command
- del arguments[argumentsIndex : argumentsIndex + 2]
- command = shlex.join(arguments)
- if isCommandStillFine(command):
- print(len(command), 'still fine')
- changedSomething = True
- break
- else:
- command = previousCommand
- arguments = shlex.split(command)
+ interested = [['-H', 1], ['--compressed', 0]]
+ for interestedArgument, offset in interested:
+ if argument == interestedArgument:
+ previousCommand = command
+ del arguments[argumentsIndex : argumentsIndex + 1 + offset]
+ command = shlex.join(arguments)
+ if isCommandStillFine(command):
+ print(len(command), 'still fine')
+ changedSomething = True
+ break
+ else:
+ command = previousCommand
+ arguments = shlex.split(command)
if not changedSomething:
break
Can use the following to add a pre-request that reset the website state.
diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 352c813..db2a8cc 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -28,6 +28,9 @@ removeUrlParameters = True
removeCookies = True
removeRawData = True
+with open('resetCurlCommand.txt') as f:
+ resetCommand = f.read()
+
# Pay attention to provide a command giving plaintext output, so might required to remove `Accept-Encoding` HTTPS header.
with open(curlCommandFilePath) as f:
command = f.read()
@@ -39,6 +42,7 @@ def executeCommand(command):
return result
def isCommandStillFine(command):
+ executeCommand(resetCommand)
result = executeCommand(command)
return wantedOutput in result
'''data = json.loads(result)
Have 64 still fine
repeated 33 times when working with https://www.youtube.com/playlist?list=WL
This issue happens when end up with -H 'Cookie: '
in minimizedCurl.txt
after removing all cookies. Unable to reproduce for cookies. Same issue with https://secure2.ldlc.com/fr-fr/Orders/PartialCompletedOrderContent?orderId for --raw-data
.
Also have an inconsistency length issue it seems, as have the following when on a YouTube video remove it from Watch Later
playlist:
4907
Removing raw data
5047 still fine
Could still simplify some cookies, as some websites such as YouTube put multiple values within a cookie, for instance: PREF=f6=40000001&f7=140&hl=en&tz=Europe.Paris&f5=20000&f4=4000000&autoplay=true
.
Verify automatically the command that we are about to declare as minimized, as it may not be minimized as this command doesn't work anymore due to number of requests restrictions. It's the case with https://www.sncf-connect.com/bff/api/v1/itineraries for instance. Could also precise an exit response, such that if we face it, then it stops the algorithm.
Can for instance use:
diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index dc21ee6..bdd19de 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -17,11 +17,14 @@ from urllib.parse import urlparse, parse_qs, quote_plus
# Could precise the input file and possibly remove the output one as the minimized requests start to be short.
if len(sys.argv) < 3:
- print('Usage: ./minimizeCURL curlCommand.txt "Wanted output"')
+ print('Usage: ./minimizeCURL curlCommand.txt "Wanted output" <Unwanted output>')
exit(1)
curlCommandFilePath = sys.argv[1]
wantedOutput = sys.argv[2]
+unwantedOutput = None
+if len(sys.argv) > 3:
+ unwantedOutput = sys.argv[3]
# The purpose of these parameters is to reduce requests done when developing this script:
removeHeaders = True
@@ -41,6 +44,9 @@ def executeCommand(command):
def isCommandStillFine(command):
result = executeCommand(command)
+ if unwantedOutput != None and unwantedOutput in result:
+ print(f'The unwanted output is contained in the command: {command}!')
+ exit(1)
return wantedOutput in result
print(len(command))
Replacing true
by false
, when can't remove the boolean entry, makes the entry longer and possibly modify its behavior, so let's not do that.
If no wanted output is specified, just being identical to initial request response could be interesting, as in some cases (notably for other websites) it is unclear what to focus on.
diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 6d1dec3..3f75dea 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -36,6 +36,7 @@ with open(curlCommandFilePath) as f:
def executeCommand(command):
# `stderr = subprocess.DEVNULL` is used to get rid of curl progress.
# Could also add `-s` curl argument.
+ input('Ready to send request?')
result = subprocess.check_output(f'{command}', shell = True, stderr = subprocess.DEVNULL).decode('utf-8')
return result
to manually interact with the website to make requests return a given output.
I designed it for inbox Mark all as read
but it anyway returns:
{
"UnreadMessagesCount": 0
}
Could otherwise specify if the wanted behavior was observed, but then heavily rely on the human operator and maybe the script currently does not minimize such interactions.
To add regex support being backward compatible:
diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index a572567..4280f0e 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -17,6 +17,7 @@ import subprocess
import json
import copy
import sys
+import re
from urllib.parse import urlparse, parse_qs, quote_plus
# Could precise the input file and possibly remove the output one as the minimized requests start to be short.
@@ -45,7 +46,7 @@ def executeCommand(command):
def isCommandStillFine(command):
result = executeCommand(command)
- return wantedOutput in result
+ return re.search(wantedOutput, result) is not None
print(len(command))
# To verify that the user provided the correct `wantedOutput` to keep during the minimization.
import re
wantedOutput = 'good'
#wantedOutput = 'good|better'
resultState = 'good'
#resultState = 'better'
#resultState = 'bad'
result = f'a {resultState} result'
print(f'\'{wantedOutput}\' in \'{result}\'')
print(wantedOutput in result)
print(f'search \'{wantedOutput}\' in \'{result}\'')
print(re.search(wantedOutput, result) is not None)
Could minimize the URL as well for instance maybe. For instance Carrefour challenge fidélités, cf Improve_websites_thanks_to_open_source/issues/417:
What about Selenium required URLs due to Cloudflare?
import requests
from lxml import html
url = 'https://www.researchgate.net/profile/Chiara_Albisani/publication/352171921_Checking_PRNU_Usability_on_Modern_Devices/links/61cc5407b6b5667157b22ded/Checking-PRNU-Usability-on-Modern-Devices.pdf?origin=publicationDetail&_sg%5B0%5D=nDOwA7YEmKF9WaeOFPip9QacTaWMZE_dVTnos-xYJ9s8HGUvzglhWluJPxoHq9TiaoJLAvAetAgmeNKx-wvU8A.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_sg%5B1%5D=P40TtMRMXR5y8PhDtDwlWd4lbaBgQ_AwTo8rKHMW_8eu5hXOju_PDtHb5iPLJvA1hovOr_H7PtlG6kYJMXceQr3PWnCmrl1tQs_DSf26T6Ly.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_iepl=&_rtd=eyJjb250ZW50SW50ZW50IjoibWFpbkl0ZW0ifQ%3D%3D&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIiwicG9zaXRpb24iOiJwYWdlSGVhZGVyIn19'
text = requests.get(url).text
tree = html.fromstring(text)
cloudFlareBlocked = tree.xpath('//title')[0].text_content() == 'Just a moment...'
print(cloudFlareBlocked) # `True`
Well in my case it is a PDF, hence page_source
is not loading every page, so it is not a question of timing.
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
#options.add_argument('-headless')
browser = webdriver.Firefox(options=options)
browser.get(url)
print('expose a sort of diagonal correlation patter' in browser.page_source) # `False`
browser.close()
I was trying to play with options.set_preference('pdfjs.disabled', True)
but then browser.get(url)
was not terminating while the download has finished.
In fact just as usually copy-pasting cURL from Firefox does the job.
Removing URL parameters
is problematic due to URL decoding for:
curl 'https://www.researchgate.net/profile/Chiara_Albisani/publication/352171921_Checking_PRNU_Usability_on_Modern_Devices/links/61cc5407b6b5667157b22ded/Checking-PRNU-Usability-on-Modern-Devices.pdf?origin=publicationDetail&_sg%5B0%5D=nDOwA7YEmKF9WaeOFPip9QacTaWMZE_dVTnos-xYJ9s8HGUvzglhWluJPxoHq9TiaoJLAvAetAgmeNKx-wvU8A.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_sg%5B1%5D=P40TtMRMXR5y8PhDtDwlWd4lbaBgQ_AwTo8rKHMW_8eu5hXOju_PDtHb5iPLJvA1hovOr_H7PtlG6kYJMXceQr3PWnCmrl1tQs_DSf26T6Ly.qQu8Kcis3LskK6M3UTyX9v7UjNkoVNJA7wK4vnFDBzxoTv13mn7Pw9tZrszf2f-yQuIDmuyNSGrglgTRsgDd1g&_iepl=&_rtd=eyJjb250ZW50SW50ZW50IjoibWFpbkl0ZW0ifQ%3D%3D&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InB1YmxpY2F0aW9uIiwicGFnZSI6InB1YmxpY2F0aW9uIiwicG9zaXRpb24iOiJwYWdlSGVhZGVyIn19' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Gecko/20100101 Firefox/123.0' -H 'Sec-Fetch-Site: cross-site'
Just parsing JavaScript statistically would be better in case the ytInitialData
script
location changes. Otherwise could loop on all script
s.
from lxml import html
with open('videos.html') as f:
content = f.read()
tree = html.fromstring(content)
for scriptIndex, script in enumerate(tree.xpath('//script')):
# `ytInitialData = ` is more precise but let us ease human work but not replace it completely.
if 'ytInitialData' in script.text_content():
print(scriptIndex)
36
38
diff --git a/tools/getJSONPathFromKey.py b/tools/getJSONPathFromKey.py
index e7493d0..dd73dad 100755
--- a/tools/getJSONPathFromKey.py
+++ b/tools/getJSONPathFromKey.py
@@ -26,6 +26,7 @@ As there are potentially multiple JavaScript variable names you can provide as t
import sys
import json
+from lxml import html
def treatKey(obj, path, key):
objKey = obj[key]
@@ -76,9 +77,12 @@ with open(filePath) as f:
if not isJSON:
with open(filePath) as f:
content = f.read()
- # Should use a HTML and JavaScript parser instead of proceeding that way.
+
+ # Should use a JavaScript parser instead of proceeding that way.
# Same comment concerning `getJSONStringFromHTMLScriptPrefix`, note that both parsing methods should be identical.
- newContent = content.split(ytVariableName + ' = ')[1].split(';<')[0]
+ tree = html.fromstring(content)
+ scriptContent = tree.xpath('//script')[36].text_content()
+ newContent = scriptContent.split(ytVariableName + ' = ')[1][:-1]
with open(filePath, 'w') as f:
f.write(newContent)
Same output with a new identical URL videos.html
. What about other YouTube pages? Got 35
and 37
with second page of https://www.youtube.com/@Squeezie/videos
Make the condition being content being identical and not part being part of result. If want to hardcode the wanted result:
diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 67ce9cd..507921a 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -25,7 +25,9 @@ if len(sys.argv) < 3:
exit(1)
curlCommandFilePath = sys.argv[1]
-wantedOutput = sys.argv[2].encode('utf-8')
+wantedOutputFilePath = sys.argv[2].encode('utf-8')
+with open(wantedOutputFilePath) as f:
+ wantedOutput = f.read()
# The purpose of these parameters is to reduce requests done when developing this script:
removeHeaders = True
@@ -45,7 +47,7 @@ def executeCommand(command):
def isCommandStillFine(command):
result = executeCommand(command)
- return wantedOutput in result
+ return wantedOutput == result
print(len(command))
# To verify that the user provided the correct `wantedOutput` to keep during the minimization.
but an easier approach consists in just considering the initial command result, as the command is already hardcoded:
diff --git a/tools/minimizeCURL.py b/tools/minimizeCURL.py
index 67ce9cd..2828d7e 100755
--- a/tools/minimizeCURL.py
+++ b/tools/minimizeCURL.py
@@ -20,12 +20,11 @@ import sys
from urllib.parse import urlparse, parse_qs, quote_plus
# Could precise the input file and possibly remove the output one as the minimized requests start to be short.
-if len(sys.argv) < 3:
- print('Usage: ./minimizeCURL curlCommand.txt "Wanted output"')
+if len(sys.argv) < 2:
+ print('Usage: ./minimizeCURL curlCommand.txt')
exit(1)
curlCommandFilePath = sys.argv[1]
-wantedOutput = sys.argv[2].encode('utf-8')
# The purpose of these parameters is to reduce requests done when developing this script:
removeHeaders = True
@@ -43,16 +42,13 @@ def executeCommand(command):
result = subprocess.check_output(command, shell = True, stderr = subprocess.DEVNULL)
return result
+wantedOutput = executeCommand(command)
+
def isCommandStillFine(command):
result = executeCommand(command)
- return wantedOutput in result
+ return wantedOutput == result
print(len(command))
-# To verify that the user provided the correct `wantedOutput` to keep during the minimization.
-if not isCommandStillFine(command):
- print('The wanted output isn\'t contained in the result of the original curl command!')
- exit(1)
-
if removeHeaders:
print('Removing headers')
Related to Improve_websites_thanks_to_open_source/issues/353#issuecomment-1725724.
This tool does not remove anchors.
Maybe can choose a preferred HTTPS request format between Firefox and Chromium for instance. This algorithm doesn't seem compatible with Chromium, it would be interesting to add it, as Chromium has some features that Firefox doesn't have. Furthermore, if someone provides me a different cURL request that what I expect and is using Chromium then this tool is quite useless as is, this case happened. The concerned Chromium request seems to have expired.
In theory no translation is necessary, just a global support.
Cf
tools/simplifyCURL.py
.