Broken web links - Githubissues

github-actions[bot] commented 1 year ago

This issue was opened automatically but led to a discussion by humans

Errors were reported while checking the availability of links: Issues found in 6 inputs. Find details below.

[faq.md]: ✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2009/047768.html | Failed: Network error: Not Found ⧖ [TIMEOUT] http://coastwatch.pfeg.noaa.gov/erddap/convert/units.html | Timeout ✗ [ERR] http://kitt.llnl.gov/trac/wiki/SatelliteData | Failed: Network error: dns error: no record found for name: kitt.llnl.gov.coi3uxiffnlergb4vem53tdisf.gx.internal.cloudapp.net. type: AAAA class: IN ✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2012/055875.html | Failed: Network error: Not Found ✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/053657.html | Failed: Network error: Not Found ✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052705.html | Failed: Network error: Not Found ✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/048064.html | Failed: Network error: Not Found ✗ [404] https://mailman.cgd.ucar.edu/pipermail/cf-metadata/2008/052334.html | Failed: Network error: Not Found

[standard_name_rules.md]: ⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[discussion.md]: ⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout ⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=inactive&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[software.md]: ✗ [ERR] http://wps-web1.ceda.ac.uk/submit/form?proc_id=CFChecker | Failed: Network error: dns error: no record found for name: wps-web1.ceda.ac.uk.coi3uxiffnlergb4vem53tdisf.gx.internal.cloudapp.net. type: AAAA class: IN

[vocabularies.md]: ⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=inactive&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout ⧖ [TIMEOUT] http://cfeditor.ceda.ac.uk/proposals/1?status=active&namefilter=&proposerfilter=&descfilter=&filter+and+display=filter | Timeout

[constitution.md]: ✗ [ERR] file:///github/workspace/(https:/github.com/cf-convention/cf-conventions/blob/master/CODE_OF_CONDUCT.md) | Failed: Cannot find file

🔍 350 Total ✅ 335 OK 🚫 9 Errors (HTTP:9|Timeouts:6)

erget commented 1 year ago

Starting to follow this up a bit. Some questions:

[ ] @JonathanGregory , I remember you'd done some work importing the mailman archives - can you refresh me on how we concluded there? Did we end up merging that archive or continuing to point to mailman.cgd.ucar.edu? IIRC we never did merge it, but I could be wrong. It looks to me like the archive isn't reachable so it might be worth discussing our stance here again.
[ ] @JonathanGregory @davidhassell regarding the failing link under software.md, do you know what the status of that server is? I can't get to it. Since it's also available from NCAS via GitHub, we could consider just deleting the link to the CEDA page. What are your thoughts?
[ ] @JonathanGregory @davidhassell same thing for all pages under cfeditor.ceda.ac.uk - I understand why they're timing out, I haven't been able to load either of them from my browser! But the connection's alive. Is this to be expected? Would it be possible to get that server more responsive, or should we exclude those links from checking? Whilst writing this comment I haven't managed to load the timed out links, but your mileage may vary...

I see there are 2 options of interest, exclude and timeout. Default timeout is 20s, which is already fairly long, and I tend to find it reasonable considering what I believe to be normal human user behaviour. For the moment I'll try excluding the domains in question. Then I think, beyond the question of what stance we adopt toward those domains, the only remaining "real" issue would be the link in constitution.md.

JonathanGregory commented 1 year ago

Dear Daniel @erget

Thanks for following this up. Yes, we imported the mailman archive. Our copy is linked on the discussion page on the same line, and in front of, the UCAR original. We can fix those missing links by pointing to the appropriate place in our copy, but unfortunately this will take a bit of work because our copy is not grouped into years.

I don't know why the link to the CEDA cf-checker isn't working, but @RosalynHatcher probably could advise.

I don't know why the CEDA editor isn't working. It has worked in the past, although it's always rather slow to answer that query. Maybe Alison @japamment could comment?

Best wishes for 2023

Jonathan

JonathanGregory commented 1 year ago

I agree that the weekly repetition of the same broken links has been a nuisance and thanks for stopping it. But we mustn't forget to fix them! We should keep this issue open until it's done.

erget commented 1 year ago

@JonathanGregory I agree in principle. I've now fixed what I think is everything in #320 . What this does:

Run on pull request to main and periodically via crontab
Upload issues if and only if issues are found that aren't noted already.
Ignored links are described in the GitHub job, and implemented in a .lycheeignore file.

So I propose we merge that PR, close this issue, plus the other one related to link checking, and open one to get address the longer-term issues of migrating the mailman archive and figuring what's up with the domains that are timing out. What do you think?

erget commented 1 year ago

Never mind... Something's not right with how the job isn't failing when I expect it to. I'll need some help to finalise this. Will request that in the requisite PR.

JonathanGregory commented 1 year ago

I've closed https://github.com/cf-convention/cf-convention.github.io/issues/330 because it was the same errors as this one.

JonathanGregory commented 1 year ago

In issue 345, which is the most recent output of the cron job, @DocOtak wrote

@JonathanGregory @erget Do we want to disable the cron tasks for this?

I've been closing the new ones (like 345) every Monday morning, as a human cron daemon. I don't mind doing that, but equally I don't think it helps to have a new one every week until we've fixed the missing links identified by this edition.

erget commented 1 year ago

@JonathanGregory @DocOtak I agree, actually we could disable this until we get it fixed - we've made progress on it but slowly ;) @DocOtak do you have the rights to disable the cron task, and could it be executed manually in that case?

JonathanGregory commented 1 year ago

As we have probably fixed all the recurrent broken links on the website, we don't need to disable the link-checker, as discussed in this issue. I will therefore close this issue, and we will see what the link-checker has to say when it next executes.

JonathanGregory commented 6 months ago

I'm reopening this because Antonio @cofinoa has reinitiated work on it in https://github.com/cf-convention/cf-convention.github.io/pull/320. Thanks, Antonio and others. It'll be good not to have to close an issue every Monday morning. :smile: Also, I'm closing https://github.com/cf-convention/cf-convention.github.io/issues/447, which deals with the same issue.

cofinoa commented 6 months ago

@erget @JonathanGregory and @sadielbartholomew I have come back to the link checker errors and problems. Sorry if a miss some already open issue related.

I will merge #320 .... but that doesn't fix all missing/timeout/vanished (broken) links.

I'm preparing and PR to fix/improve the link checker and try silent some permanent broken links and old documents with wrong UTF-8 encoding where link checker fails with an error.

JonathanGregory commented 6 months ago

Thanks for working on this, @cofinoa.

JonathanGregory commented 6 months ago

For reference I am copying here various comments from https://github.com/cf-convention/cf-convention.github.io/pull/486

@cofinoa https://github.com/cf-convention/cf-convention.github.io/pull/486#issuecomment-2075566036

Dear @cf-convention/info-mgmt team, this PR relates to long-standing issue with the link checker, see: #318 #320

This is a first step to fix the issues with the link checker when PR are made.

The action is triggered when PR are open/re-open and

first check if all **/*.md files in the repo has no broken links, if so the action fails, in the summary of the action run, the resulting output can be seen, including when "everything" is "green". From https://github.com/cf-convention/cf-convention.github.io/actions/runs/8821086583?pr=486 :

Summary

Status Count

🔍 Total 593

✅ Successful 495

⏳ Timeouts 0

🔀 Redirected 0

👻 Excluded 98

❓ Unknown 0

🚫 Errors 0

Status	Count
🔍 Total	593
✅ Successful	495
⏳ Timeouts	0
🔀 Redirected	0
👻 Excluded	98
❓ Unknown	0
🚫 Errors	0

I have had to fix some existing links to the rendered .html files, and change them to the actual .md which is what should be referenced in .md files instead, see commit 468dece
the change at vocabularies.md has been tricky because the md file uses HTML tags.

If .md are ok, the action checks that building the site with Jekyll works and, if so, it uploads the artifact.

Please take a moment to review and let me know if this fits. If so, I will continue with the PR to incorporate the link check of the site at regular basis (i.e. cron job every Monday), or just we can merge this PR and open a new one PR for the that.

PS: annotation of the PR with a comment with the link check report it's a challenge due to security issues with PR from forks. If PR are from same repo (not) forks then PR and ISSUE commenting it's possible.

PS2: Checking links to GITHUB may raise and issue with the limit rate of GITHUB HTTP requests

PS: annotation of the PR with a comment with the link check report it's a challenge due to security issues with PR from forks. If PR are from same repo (not) forks then PR and ISSUE commenting it's possible.

PS2: Checking links to GITHUB may raise and issue with the limit rate of GITHUB HTTP requests

@larsbarring https://github.com/cf-convention/cf-convention.github.io/pull/486#issuecomment-2075610871

Hi Antonio,

the change at vocabularies.md has been tricky because the md file uses HTML tags.

I was recently adding some minor changes to this file, and noticed that there is actually very little markdown, and a lot of repetitive html links. I though that it maybe would be possible to generate this file dynamically during the build process. Something like a small [python] script looking for through the relevant ../Data/ directories for which versions exists and then assembles the file based on that and md text fragments, either read from file(s) or stored within the script. Could this be something to look further into (I'm afraid it's beyond my skill set)?

@cofinoa https://github.com/cf-convention/cf-convention.github.io/pull/486#issuecomment-2076073653

@larsbarring I have made a new PR at #487 with your suggestion to refactor vocabularies.md

JonathanGregory commented 6 months ago

Thanks for https://github.com/cf-convention/cf-convention.github.io/pull/487, Antonio. I don't fully understand this. Is this problem to the link-checker caused by a link to a markdown page from HTML, which is itself wrapped up as a markdown page? This seems rather convoluted. If the whole page is put in markdown instead, does that resolve it?

JonathanGregory commented 6 months ago

From @cofinoa https://github.com/cf-convention/cf-convention.github.io/pull/487

@larsbarring, as you have suggested at PR #486 , I have refactored vocabularies.md to actual Markdown, instead HTML, and also I have generated automatically some of the lists.

Jekyll, it's quite limited to manage data and/or strings, and it's needed to create a Jekyll plugin to improve that, but my Ruby skills are also quite limited.

From @larsbarring https://github.com/cf-convention/cf-convention.github.io/pull/487#issuecomment-2076711737

The lists of links will be automatically generated from existing version subdirectories under the ../Data/ directory I think that the workflow will be simplified for publishing a new version of the Standard Name Table, Area Type Table (the versions of the Standardized Regions List is hardcoded?).

Hence pinging @japamment, @efisher008

cofinoa commented 6 months ago

@JonathanGregory

Is this problem to the link-checker caused by a link to a markdown page from HTML, which is itself wrapped up as a markdown page?

It's a problem to link a .HTML page which is build from a .MD page. But, because the content it's HTML, we can no link to the .MD page, as we are doing in other .MD pages, with MD content.

I have rewritten the HTML content of vocabulary.md to MD content in this commit f7f145afd4f236539e67e2c661ffce38594f285a, along with commit 468decec7eac95081bb02b61f3ce3d3447bbe942 which fixes links to .HTML pages which are .MD pages.

cofinoa commented 6 months ago

@larsbarring

The lists of links will be automatically generated from existing version subdirectories under the ../Data/ directory I think that the workflow will be simplified for publishing a new version of the Standard Name Table, Area Type Table (the versions of the Standardized Regions List is hardcoded?).

I have closed PR #487 because automatic generation of links to ./Data content has some pitfalls that need to be overcome, refactoring some directories in Data/. For example, the directory with v84 of standard-names, has also a current directory, and I think this is a BUG, that needs to be fixed.

cofinoa commented 6 months ago

@JonathanGregory

As we have probably fixed all the recurrent broken links on the website, we don't need to disable the link-checker, as discussed in this issue. I will therefore close this issue, and we will see what the link-checker has to say when it next executes.

There are some temporary issues with some links, for example I have had to exclude: https://mmisw.org/ont

in lychee.toml configuration file (see 65da6b29fc8e01a8435e73b1df174b71cd34f786)

larsbarring commented 6 months ago

@cofinoa OK, I agree that having a closer look at the directory structure under ./Datais worthwhile. I came to the same conclusion when working on the old versions of the standard name table. It also have bearing on this discussion. So, let's come back to this idea in a while.

JonathanGregory commented 5 months ago

There was no new broken links report this morning, I am very pleased to see. Thanks for suppressing it, @cofinoa! Shall we close this issue, or is it still a work in progress?

cofinoa commented 5 months ago

@JonathanGregory, It's still in progress.

If, it's OK, I would like to merge PR #486, which it's an intermediate step, before to solve this issue.

JonathanGregory commented 5 months ago

That's fine. Let's leave it open then. Thanks.

cofinoa commented 5 months ago

I have created 2 workflows/actions:

check_jekyll_build.yml: An action with 2 main jobs triggered when a PR it's created : A. to check links in Markdown files (./**/*.md) B. to check that Jekyll can build the Website
1. check_links_cron.yml: The other action that runs on Mondays, and has also 2 main jobs: C. to check that Jekyll can build the Website D. to check links on the site built on job C, and if it fails a new issue, it's open: #490

The exclusion rules are at .lychee/config.toml which are being used for both actions (1 and 2), but we can create different ones for each action, in case it's needed.

Currently, I have excluded the following URL:

exclude = [
    # Data/cf-standard-names/
    "http://glossary.ametsoc.org/wiki",                   
    "https://www.unidata.ucar.edu/software/udunits/udunits-current/doc/udunits",
    "https://www.unidata.ucar.edu/software/udunits/udunits-2.2.28/udunits2.html", 
    "https://www.sciencedirect.com/science/article/pii/0967063793901018",
    "https://www.ipcc.ch/ipccreports/tar/wg1/273.htm",
    "http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata",
    "http://gcmd.nasa.gov/Resources/valids",
    #
    "cfeditor.ceda.ac.uk",                                # standard_name_rule, vocabularies, discussion
    "https://mailman.cgd.ucar.edu/pipermail/cf-metadata", # discussion, governance
    "http://mmisw.org/ont", # faq (TIMEOUT)
    "https://mmisw.org/ont", # faq (TIMEOUT)
    "http://www.cgd.ucar.edu/cms/eaton/cf-metadata/clivar_article.pdf", # Data/cf-documents/cf-governance/cf2_whitepaper_final.html
    "http://www.cgd.ucar.edu/cms/eaton/cf-metadata/CF-current.html", # Data/cf-documents/requirements-recommendations
    "https://www.usbr.gov/lc/socal/reports/SMappend_C.pdf", # Data/area-type-table/**/build/area-type-table.html
    "https://cf-trac.llnl.gov/trac/",                     # 2018-Workshop, 2019-Workshop
    "http://mailman.cgd.ucar.edu/pipermail/cf-metadata",  # 2019-Workshop
    "https://www.wonder.me",                              # 2021-Workshop
    "https://figshare.com/account/articles/24633939",     # 2023-Workshop
    "https://figshare.com/account/articles/24633894",     # 2023-Workshop
]

Some of the excluded URL, are spurious broken links, which are temporarily broken.

Other, are permanently broken, and we need to decided what to do [1].

Also, I have excluded to check some paths, mainly because they contain some documents with invalid encoding, or many broken relatives links (i.e. Trac-tickets):

exclude_path = [
    "_site/Data/cf-standard-names/docs/guidelines.html",
    "_site/Data/cf-conventions/",
    "_site/Data/Trac-tickets/",
]

regards

[1] For example, for the https://www.ipcc.ch/ipccreports/tar/wg1/273.htm

we could link to a capture from the Wayback Machine: https://web.archive.org/web/20181104000136/http://www.ipcc.ch/ipccreports/tar/wg1/273.htm

cofinoa commented 5 months ago

@JonathanGregory

I have improved the weekly cron workflow for the link checker (check_links_cron.yml) and now, if a ISSUE with labels: defect, link-checker, report, automated issue and status open, then the action will insert a new comment. If no ISSUE it's open with those labels, a new ISSUE it's been opened.

You can see a sample at issue #493

JonathanGregory commented 5 months ago

That's a very useful improvement. Thanks, Antonio.

cofinoa commented 5 months ago

@JonathanGregory et al. the issue #493 with broken link report has been updated, and new comment has been added to the issue for today's checker cron job: https://github.com/cf-convention/cf-convention.github.io/issues/493#issuecomment-2106970778

I have re-run the checker manually and "new" error appear, and the others disappear. The issue has been updated with the report for this "manual" check: https://github.com/cf-convention/cf-convention.github.io/issues/493#issuecomment-2107398673

IMO, there are 2 pending actions that we need to discuss:

How to proceed with this spurious check errors? How long an error must persist and/or how many times appears, to consider an action on it, i.e. to be excluded?
There are already some link and exclusions that need to review similarly as previous point

It maybe would be useful to add this to the next meeting for the Information Management Team @cf-convention/info-mgmt

cofinoa commented 5 months ago

I'm closing this to continue discussion at https://github.com/orgs/cf-convention/discussions/320

cf-convention / cf-convention.github.io

Broken web links #318

Summary