department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
282 stars 200 forks source link

Link checking may be missing broken internal links #28139

Closed swirtSJW closed 1 year ago

swirtSJW commented 3 years ago

The link checker that runs on the content-build has a method "isExternal" that is supposed to prevent scanning of external links.
https://github.com/department-of-veterans-affairs/content-build/blob/048df01a6b55f6dab42d43d332f3b8280d2e6ed7/src/site/stages/build/plugins/modify-dom/check-broken-links/helpers/isBrokenLink.js#L14 The problem is the name is misleading. It is not exempting external links, it is exempting links with schemes(protcols).

The method should be aligned with its name to return TRUE for any link that is pointing to an actual external site.

Pattern Return
no scheme FALSE
domain = va.gov FALSE
domain = www.va.gov FALSE
other subdomains of va.gov ~?~ edited 8/5/22: FALSE
all other domains TRUE

The risk in leaving it as it is, is that some links that are internal are actually not being tested and as a result are not being raised if they are broken. Example: A link made to '/not/a/page/' would be checked, but a link to 'https://va.gov/not/a/page` would not be checked. Copy and pasting from the actual website or older Teamsite content into the CMS can easily result in the latter pattern.

kevwalsh commented 3 years ago

I wonder how many broken links are being missed. Before any fix gets merged, it would be great to get a list of any that have been missed. A number of teams may need to set time aside to jump in: Public Websites, Facilities, VAMC upgrade, Decision Tools (Forms), and CMS team.

swirtSJW commented 2 years ago

This was piped into some request process here https://github.com/department-of-veterans-affairs/va.gov-team/issues/32056

wesrowe commented 2 years ago

@timwright12 or anyone seeing this,

We (Public Websites team) are trying to get a better handle on when Form PDF links get broken, so that we can respond sooner for Veterans. @swirtSJW pointed me to this issue. Fixing this would help us utilize some broken-link checking that is already built, but which is currently not checking some Form URLs.

I would appreciate word on whether this can get picked up soon for implementation.

timwright12 commented 2 years ago

attn @rmessina1010 @laineymajor

we’ll need to get some clarity on ownership of this. Last we tried I think we were unable to get any consistency in replicating issue as it seems to randomly pop up

rmessina1010 commented 2 years ago

@swirtSJW , @wesrowe

I think I have a possible solution, but I got a questions about the desired behavior of isExternal Other subdomains …”?” Would mean? Do you want a non boolean value, if so truthy/falsy, null, undefined? Are you asking the us what we think should be returned?

swirtSJW commented 2 years ago

@rmessina1010 I think the question mark I put in there just meant that it had not been decided whether we would treat a subdomain of va.gov as an internal link.

Now that we consider it more and have a case where links to pfd forms on https://vba.va.gov/bldh/blah.pdf are broken and we never get alerted is a bad veteran experience. So I think any link to *.va.gov ought to be checked , so they should be treated as internal (return FALSE).

swirtSJW commented 2 years ago

I just edited the table in the description to reflect ^^

timwright12 commented 2 years ago

Yeah, there are a ton of *.va.gov URLs that aren't public (e.g. http://jenkins.vfs.va.gov/). Is there any possibility of getting this as a pre-publish feature in Drupal? Only asking because it actually takes a ton of processing to do it in build.

To actually check if a link is broken we'd have to make an Ajax call to the URL, which will really slow the build down and we might hit CORS issues going to a subdomain (which would also happen in a Drupal-check).

swirtSJW commented 2 years ago

We do check as publish, We check all external links being saved. But that only catches on edit... not on if the link breaks because somebody moves something outside of the CMS.

Example:

I see what you are saying though that right now this check of links is just in code, not an actual CURL. We could set up something in the CMS to do a daily curl of external links. So maybe we don't solve the subdomain problem with this issue. It is still worth closing the hole with the other cases though.

swirtSJW commented 2 years ago

Yeah, there are a ton of *.va.gov URLs that aren't public (e.g. http://jenkins.vfs.va.gov/)

If they aren't public they should not be appearing in the CMS content. In either case of the check, they would slip through.

timwright12 commented 2 years ago

We could set up something in the CMS to do a daily curl of external links.

Good call there; we can probably do this in a github action CRON as well. This is all in a "modify DOM" step in the build that takes forever, so anything we can do to slim that down would be awesome

swirtSJW commented 2 years ago

We have something coming up that actually creates an entity for each external link so that not only will we have a deduped list of external links, but also the ability to fix it once and have it update everywhere in the CMS. SO checking them could move to that system. (not ready yet but coming)

rmessina1010 commented 2 years ago

@swirtSJW, ok. Just in case it's still relevant, I was asking because I was thinking of a solution changing the logic behind the isExternal variable to exclude: 'va.gov' and ALL possible subdomains + any link with no scheme https://github.com/department-of-veterans-affairs/content-build/pull/1246

rmessina1010 commented 1 year ago

@swirtSJW was this issue officially resolved?

rmessina1010 commented 1 year ago

Expired. closing.

swirtSJW commented 10 months ago

I'm a little confused. This got closed without the related code in the PR being merged https://github.com/department-of-veterans-affairs/content-build/pull/1246