cynkra / cynkrablog

Source of the cynkra blog
https://cynkra.com/blog
0 stars 1 forks source link

html-validate #26

Open maelle opened 1 year ago

maelle commented 1 year ago

https://github.com/quarto-dev/quarto-cli/discussions/7489

maelle commented 11 months ago

I fear the package is a thing to be started with Apache https://askubuntu.com/questions/471523/install-wc3-markup-validator-locally

maelle commented 11 months ago

based on that it seems to be that to use that package in a workflow some configuration files would need to be changed.

then one would need to serve both the website under scrutiny and the validator, then send the link to the website under scrutiny to the validator, then parse the results that would be a HTML file.

or maybe if one serves the validator, then there's an API.

I hope to find some better docs somewhere.

@krlmlr

maelle commented 11 months ago

https://validator.w3.org/docs/users.html#Installing

maelle commented 11 months ago

apparently any instance would have the API https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-GET

maelle commented 11 months ago

I was hoping to find a ready-made action but didn't find one.

maelle commented 11 months ago

found https://www.npmjs.com/package/html-validator by chance (was working on some other invalid HTML :joy: )

maelle commented 11 months ago

but it would use the API

pat-s commented 11 months ago

What's wrong with https://github.com/cynkra/cynkraweb/blob/01998ff7e0574cb23ff4ca5f8bf27da6922d3e34/.github/workflows/s3-push.yaml#L110-L112?

krlmlr commented 11 months ago

The Quarto team doesn't agree that this validator is an authority, but they follow the w3c one.

https://github.com/quarto-dev/quarto-cli/discussions/7489

If the w3c validator is difficult to operate, we could also validate once with the w3c validator, and then come up with exclusions for our validator that lead to a green build.

To recap, why I think validation is important: I've heard that search engines treat well-formatted websites better than crappy ones. Happy to revisit this stance if it's irrelevant or wrong.

maelle commented 9 months ago

A first step would be to identify which pages are modified so as not to send the whole site to the API. :thinking:

Probably not just a Git thing because a page's metadata might have changed (so different for Git) without it being worth sending it to the API.

Maybe a sitemap thing. Download the current sitemap, get the new one, send the new pages to the API.

krlmlr commented 9 months ago

To me, detecting changes is independent, and could also be postponed?

maelle commented 9 months ago

We need to know which pages to send the API.

maelle commented 9 months ago

Current script, something is wrong with how I send the document as it's not properly detected.

current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
  xml2::xml_text()

# quarto::quarto_render()

new_sitemap <-  xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
  xml2::xml_text()
added_links <- setdiff(new_links, current_links)

validate_page <- function(url) {
  file <- file.path("docs", urltools::path(url))
  httr2::request("http://validator.w3.org/nu/?out=json") |>
    httr2::req_method("POST") |>
    httr2::req_headers(
      `Content-Type` = "text/html",
      "charset"="utf-8"
    ) |>
    httr2::req_body_file(file) |>
    httr2::req_perform() |>
    httr2::resp_body_json()

}
maelle commented 9 months ago

https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-POST-body

maelle commented 9 months ago

ah, using httr2::curl_translate() helped

maelle commented 9 months ago

Still not there yet.

current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
  xml2::xml_text()

# quarto::quarto_render()

new_sitemap <-  xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
  xml2::xml_text()
added_links <- setdiff(new_links, current_links)

validate_page <- function(url) {
  file <- file.path("docs", urltools::path(url))
  httr2::request("http://validator.w3.org/nu/") |> 
    httr2::req_url_query(out = "json") |>
    httr2::req_method("POST") |>
    httr2::req_headers(
      `Content-Type` = "text/html",
      "charset"="utf-8"
    ) |>
    httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n")) |>
    httr2::req_perform() |>
    httr2::resp_body_json()

}

validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#> 
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#> 
#> 
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#> 
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#> 
#> 
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#> 
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#> 
#> 
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#> 
#> $messages[[4]]$subType
#> [1] "warning"
#> 
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document."

Created on 2024-02-19 with reprex v2.1.0

maelle commented 9 months ago

The errors make no sense given the actual content of index.html, which means I am sending it in a wrong way.

maelle commented 9 months ago

Indeed, if I use showsource, it shows I sent nothing.

maelle commented 9 months ago

But the dry-run of httr2 shows content length.

maelle commented 9 months ago

I'm putting this aside for now. :disappointed:

maelle commented 8 months ago

The last time https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-POST-body was updated was in 2016, so maybe it's no longer valid?

maelle commented 8 months ago

I tried a bit more without success.

current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
  xml2::xml_text()

# quarto::quarto_render()

new_sitemap <-  xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
  xml2::xml_text()
added_links <- setdiff(new_links, current_links)

validate_page <- function(url) {
  file <- file.path("docs", urltools::path(url))
  httr2::request("http://validator.w3.org/nu/") |>
    httr2::req_url_query(out = "json", showsource = "yes", parser = "html5") |>
    httr2::req_method("POST") |>
    httr2::req_headers(
      `Content-Type` = "text/html",
      "charset"="utf-8"
    ) |>
    httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n"), "text/html; charset=utf-8") |>
    httr2::req_perform() |>
    httr2::resp_body_json()

}

validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#> 
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#> 
#> 
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#> 
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#> 
#> 
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#> 
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#> 
#> 
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#> 
#> $messages[[4]]$subType
#> [1] "warning"
#> 
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document."
#> 
#> 
#> 
#> $source
#> $source$type
#> [1] "text/html"
#> 
#> $source$code
#> [1] ""

Created on 2024-02-26 with reprex v2.1.0

krlmlr commented 8 months ago

What text are you sending to the API?

maelle commented 8 months ago

a whole HTML file. httr2::req_dry_run() shows the content is not empty... but the API output states I sent nothing.

krlmlr commented 8 months ago

The file has <!DOCTYPE html> but the API doesn't see it?

maelle commented 8 months ago

the API sees "" apparently.

krlmlr commented 8 months ago

Can you upload a file manually to https://validator.w3.org/nu/about.html ?

I'm forgetting again why this is so complicated.

What am I missing?

maelle commented 8 months ago

I wanted to use the API instead of trying to deploy the thing on GHA, but it's not working.

I had been able to use the web interface.

maelle commented 8 months ago
maelle commented 8 months ago

pfff it was actually easy, what was I thinking.

maelle commented 8 months ago

vnu-runtime-image/bin/vnu docs/index.html "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":42.1-42.174: error: Duplicate ID “quarto-text-highlighting-styles”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":41.1-41.148: info warning: The first occurrence of ID “quarto-text-highlighting-styles” was here. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":46.1-46.161: error: Duplicate ID “quarto-bootstrap”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":45.1-45.136: info warning: The first occurrence of ID “quarto-bootstrap” was here.

This is due to how Quarto handles dark mode. Both files are present in the source.

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":106.3-106.100: info warning: The “type” attribute is unnecessary for JavaScript resources. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":108.1-108.31: info warning: The “type” attribute is unnecessary for JavaScript resources.

This is about <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script> and the lines below

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":258.1-258.113: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":300.1-300.129: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":339.1-339.110: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":375.1-375.117: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":414.1-414.109: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":447.1-447.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":483.1-483.127: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":522.1-522.117: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":557.25-557.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":557.25-557.106: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":592.1-592.110: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":628.1-628.109: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":663.25-663.160: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":695.1-695.121: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":728.1-728.100: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":767.1-767.108: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":802.25-802.161: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":837.1-837.100: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":875.25-875.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":875.25-875.106: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":909.25-909.163: error: Element “img” is missing required attribute “src”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":909.25-909.163: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: error: Element “img” is missing required attribute “src”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.

https://github.com/quarto-dev/quarto-cli/discussions/6987 plus need for me to apply https://quarto.org/docs/websites/website-listings.html#listing-fields to current post

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.

This is for lines such as <p class="card-img-top"><img data-src="mountain.jpg" style="height: 150px;" class="thumbnail-image card-img"/></p>

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":987.1-987.66: info warning: The “type” attribute is unnecessary for JavaScript resources.

This refers to <script id="quarto-html-after-body" type="application/javascript">

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":1131.54-1151.17: info warning: Document uses the Unicode Private Use Area(s), which should not be used in publicly exchanged documents. (Charmod C073)

"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":1527.1-1527.17: error: Element “script” must not have attribute “async” unless attribute “src” is also specified or unless attribute “type” is specified with value “module”.

Maybe <script async="">

maelle commented 8 months ago

@DivadNojnarg do the very last two lines of the comment above make sense to you? How could we tweak the script you created to avoid the validator error?

maelle commented 8 months ago

I'll come back to this issue next week, now that I can run the validator. :smile_cat: