Closed Raza403 closed 5 months ago
Per Raza: a way to improve this is to have the test case report the HTTP error code. Did it fail to connect or is there a 404/40x error?
Also, the root cause is maybe that our IP address is getting blocked because of frequent checking. We want to make sure that our cache is working.
Something else to do is when the test case is started to report about the cache:
Loaded link checker cache: 321 items, newest 2024-03-04, oldest 2023-12-13
For example that can go here https://github.com/fulldecent/mtssites/actions/runs/8297750824/job/22709517969#step:8:14
There are chances that the IP address used by GitHub actions is blocked, but I have tried this on a new IP on a new personal laptop, and I got the same error. One problem is most of these websites are blocking curl
request, so I upgraded the curl request to make it look more legit and it worked on 80% of the sites previously giving false broken link errors.
Here is the updated curl command i used
const result = execSync(`curl --head --silent --fail --max-time ${TIMEOUT_SECONDS} --location \
--user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.9999.999 Safari/537.36" \
--header "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" \
"${url}"`);
Here is the curl command we are using right now.
curl --head --silent --fail --max-time 5 --location "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3116747/"
Nice, good to use those changes
Deployed the updated on all the sites.
Cool.
Can you please also capture the issue: to test and confirm that link cache is working. This can go in the upstream repo
How will I do it, one way I could think of is updating the message to external link is broken and saved in the cache
in the following code, this way we could be sure that link is cached.
const row = this.db.prepare("SELECT found, time FROM urls WHERE url = ?").get(url);
if (row) {
// Link is bad, from recent cache
if (row.found === 0) {
this.report({
node: element,
message: `external link is broken: ${url}`,
});
}
return;
}
Some links are not broken but are reported as broken by our external link checker. For example, the link https://twitter.com/aclstraining is not broken but reported as broken always. Here is the build.