fulldecent / github-pages-template

An opinionated starting point and build system for awesome, collaboratively-edited HTML websites
https://fulldecent.github.io/github-pages-template/
34 stars 51 forks source link

False external links broken reported. #62

Closed Raza403 closed 5 months ago

Raza403 commented 8 months ago

Some links are not broken but are reported as broken by our external link checker. For example, the link https://twitter.com/aclstraining is not broken but reported as broken always. Here is the build.

fulldecent commented 5 months ago

Per Raza: a way to improve this is to have the test case report the HTTP error code. Did it fail to connect or is there a 404/40x error?

fulldecent commented 5 months ago

Also, the root cause is maybe that our IP address is getting blocked because of frequent checking. We want to make sure that our cache is working.

Something else to do is when the test case is started to report about the cache:

Loaded link checker cache: 321 items, newest 2024-03-04, oldest 2023-12-13

For example that can go here https://github.com/fulldecent/mtssites/actions/runs/8297750824/job/22709517969#step:8:14

Raza403 commented 5 months ago

There are chances that the IP address used by GitHub actions is blocked, but I have tried this on a new IP on a new personal laptop, and I got the same error. One problem is most of these websites are blocking curl request, so I upgraded the curl request to make it look more legit and it worked on 80% of the sites previously giving false broken link errors. Here is the updated curl command i used

const result = execSync(`curl --head --silent --fail --max-time ${TIMEOUT_SECONDS} --location \
      --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.9999.999 Safari/537.36" \
      --header "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" \
      "${url}"`);

Here is the curl command we are using right now.

curl --head --silent --fail --max-time 5 --location "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3116747/"
fulldecent commented 5 months ago

Nice, good to use those changes

Raza403 commented 5 months ago

Deployed the updated on all the sites.

fulldecent commented 5 months ago

Cool.

Can you please also capture the issue: to test and confirm that link cache is working. This can go in the upstream repo

Raza403 commented 5 months ago

How will I do it, one way I could think of is updating the message to external link is broken and saved in the cache in the following code, this way we could be sure that link is cached.

const row = this.db.prepare("SELECT found, time FROM urls WHERE url = ?").get(url);
    if (row) {
      // Link is bad, from recent cache
      if (row.found === 0) {
        this.report({
          node: element,
          message: `external link is broken: ${url}`,
        });
      }
      return;
    }