Standalone (batch- and command-line) and Gradle-plugin html sanity checker - detects missing images, dead links and cross-references, duplicate link targets (anchors) and the like.
Amazon seems to behave differently for unknown URLs depending on misc. request parameters.
Currently I run into test errors with the test case BrokenHttpLinksCheckerSpec:bad amazon link is identified as problem.
It seems to work in GitHub actions but fails on my local machine, either from single test execution from IDE (IntelliJ) as well as from a full gradlew test run.
I could track it down to the following behaviour:
When executed locally, Amazon returns a status 200 and requires a captcha resolution. The test case requires a 503 return code which results in a finding found by the HSC checker.
When executed in GitHub it seems to work as expected, returning a 503 (unfortunately we do not yet have some logging of results available).
Locally I could further change the behaviour of Amazon by setting the User-Agent header of the request.
This could even be implemented with curl
curl -X HEAD -v https://www.amazon.com/dp/4242424242 uses curl's default User-Agent (curl/8.4.0 in my case) and returns a 503 (the same holds true for GET requests)
Using curl with the default HSC User-Agent header "Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0": curl -H "User-Agent: Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0" -X GET -v https://www.amazon.com/dp/4242424242 returns a status 200 and a captcha request
Amazon seems to behave differently for unknown URLs depending on misc. request parameters. Currently I run into test errors with the test case BrokenHttpLinksCheckerSpec:bad amazon link is identified as problem. It seems to work in GitHub actions but fails on my local machine, either from single test execution from IDE (IntelliJ) as well as from a full
gradlew test
run.I could track it down to the following behaviour:
Locally I could further change the behaviour of Amazon by setting the User-Agent header of the request. This could even be implemented with
curl
curl -X HEAD -v https://www.amazon.com/dp/4242424242
uses curl's default User-Agent (curl/8.4.0
in my case) and returns a 503 (the same holds true for GET requests)curl -H "User-Agent: Mozilla/5.0 (X11; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0" -X GET -v https://www.amazon.com/dp/4242424242
returns a status 200 and a captcha requestCf. bug-316.zip
Perhaps this is similar to the the behaviour we see in #219?
I suggest to set the User-Agent header to something HSC specific (e.g,
hsc/version
).