Open maelle opened 1 year ago
I fear the package is a thing to be started with Apache https://askubuntu.com/questions/471523/install-wc3-markup-validator-locally
based on that it seems to be that to use that package in a workflow some configuration files would need to be changed.
then one would need to serve both the website under scrutiny and the validator, then send the link to the website under scrutiny to the validator, then parse the results that would be a HTML file.
or maybe if one serves the validator, then there's an API.
I hope to find some better docs somewhere.
@krlmlr
apparently any instance would have the API https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-GET
I was hoping to find a ready-made action but didn't find one.
found https://www.npmjs.com/package/html-validator by chance (was working on some other invalid HTML :joy: )
but it would use the API
The Quarto team doesn't agree that this validator is an authority, but they follow the w3c one.
https://github.com/quarto-dev/quarto-cli/discussions/7489
If the w3c validator is difficult to operate, we could also validate once with the w3c validator, and then come up with exclusions for our validator that lead to a green build.
To recap, why I think validation is important: I've heard that search engines treat well-formatted websites better than crappy ones. Happy to revisit this stance if it's irrelevant or wrong.
A first step would be to identify which pages are modified so as not to send the whole site to the API. :thinking:
Probably not just a Git thing because a page's metadata might have changed (so different for Git) without it being worth sending it to the API.
Maybe a sitemap thing. Download the current sitemap, get the new one, send the new pages to the API.
To me, detecting changes is independent, and could also be postponed?
We need to know which pages to send the API.
Current script, something is wrong with how I send the document as it's not properly detected.
current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
xml2::xml_text()
# quarto::quarto_render()
new_sitemap <- xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
xml2::xml_text()
added_links <- setdiff(new_links, current_links)
validate_page <- function(url) {
file <- file.path("docs", urltools::path(url))
httr2::request("http://validator.w3.org/nu/?out=json") |>
httr2::req_method("POST") |>
httr2::req_headers(
`Content-Type` = "text/html",
"charset"="utf-8"
) |>
httr2::req_body_file(file) |>
httr2::req_perform() |>
httr2::resp_body_json()
}
ah, using httr2::curl_translate()
helped
Still not there yet.
current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
xml2::xml_text()
# quarto::quarto_render()
new_sitemap <- xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
xml2::xml_text()
added_links <- setdiff(new_links, current_links)
validate_page <- function(url) {
file <- file.path("docs", urltools::path(url))
httr2::request("http://validator.w3.org/nu/") |>
httr2::req_url_query(out = "json") |>
httr2::req_method("POST") |>
httr2::req_headers(
`Content-Type` = "text/html",
"charset"="utf-8"
) |>
httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n")) |>
httr2::req_perform() |>
httr2::resp_body_json()
}
validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#>
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#>
#>
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#>
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#>
#>
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#>
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#>
#>
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#>
#> $messages[[4]]$subType
#> [1] "warning"
#>
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document."
Created on 2024-02-19 with reprex v2.1.0
The errors make no sense given the actual content of index.html, which means I am sending it in a wrong way.
Indeed, if I use showsource, it shows I sent nothing.
But the dry-run of httr2 shows content length.
I'm putting this aside for now. :disappointed:
The last time https://github.com/validator/validator/wiki/Service-%C2%BB-Input-%C2%BB-POST-body was updated was in 2016, so maybe it's no longer valid?
I tried a bit more without success.
current_sitemap <- xml2::read_xml("https://blog.cynkra.com/sitemap.xml")
current_links <- xml2::xml_find_all(current_sitemap, ".//d1:loc") |>
xml2::xml_text()
# quarto::quarto_render()
new_sitemap <- xml2::read_xml(file.path("docs", "sitemap.xml"))
new_links <- xml2::xml_find_all(new_sitemap, ".//d1:loc") |>
xml2::xml_text()
added_links <- setdiff(new_links, current_links)
validate_page <- function(url) {
file <- file.path("docs", urltools::path(url))
httr2::request("http://validator.w3.org/nu/") |>
httr2::req_url_query(out = "json", showsource = "yes", parser = "html5") |>
httr2::req_method("POST") |>
httr2::req_headers(
`Content-Type` = "text/html",
"charset"="utf-8"
) |>
httr2::req_body_raw(paste(brio::read_lines(file), collapse = "\n"), "text/html; charset=utf-8") |>
httr2::req_perform() |>
httr2::resp_body_json()
}
validate_page(added_links[1])
#> $messages
#> $messages[[1]]
#> $messages[[1]]$type
#> [1] "error"
#>
#> $messages[[1]]$message
#> [1] "The character encoding was not declared. Proceeding using “windows-1252”."
#>
#>
#> $messages[[2]]
#> $messages[[2]]$type
#> [1] "error"
#>
#> $messages[[2]]$message
#> [1] "End of file seen without seeing a doctype first. Expected “<!DOCTYPE html>”."
#>
#>
#> $messages[[3]]
#> $messages[[3]]$type
#> [1] "error"
#>
#> $messages[[3]]$message
#> [1] "Element “head” is missing a required instance of child element “title”."
#>
#>
#> $messages[[4]]
#> $messages[[4]]$type
#> [1] "info"
#>
#> $messages[[4]]$subType
#> [1] "warning"
#>
#> $messages[[4]]$message
#> [1] "Consider adding a “lang” attribute to the “html” start tag to declare the language of this document."
#>
#>
#>
#> $source
#> $source$type
#> [1] "text/html"
#>
#> $source$code
#> [1] ""
Created on 2024-02-26 with reprex v2.1.0
What text are you sending to the API?
a whole HTML file. httr2::req_dry_run()
shows the content is not empty... but the API output states I sent nothing.
The file has <!DOCTYPE html>
but the API doesn't see it?
the API sees ""
apparently.
Can you upload a file manually to https://validator.w3.org/nu/about.html ?
I'm forgetting again why this is so complicated.
What am I missing?
I wanted to use the API instead of trying to deploy the thing on GHA, but it's not working.
I had been able to use the web interface.
pfff it was actually easy, what was I thinking.
vnu-runtime-image/bin/vnu OPTIONS FILES
vnu-runtime-image/bin/vnu docs/index.html "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":42.1-42.174: error: Duplicate ID “quarto-text-highlighting-styles”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":41.1-41.148: info warning: The first occurrence of ID “quarto-text-highlighting-styles” was here. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":46.1-46.161: error: Duplicate ID “quarto-bootstrap”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":45.1-45.136: info warning: The first occurrence of ID “quarto-bootstrap” was here.
This is due to how Quarto handles dark mode. Both files are present in the source.
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":106.3-106.100: info warning: The “type” attribute is unnecessary for JavaScript resources. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":108.1-108.31: info warning: The “type” attribute is unnecessary for JavaScript resources.
This is about <script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js" type="text/javascript"></script>
and the lines below
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":258.1-258.113: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":300.1-300.129: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":339.1-339.110: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":375.1-375.117: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":414.1-414.109: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":447.1-447.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":483.1-483.127: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":522.1-522.117: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":557.25-557.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":557.25-557.106: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":592.1-592.110: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":628.1-628.109: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":663.25-663.160: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":695.1-695.121: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":728.1-728.100: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":767.1-767.108: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":802.25-802.161: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":837.1-837.100: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":875.25-875.106: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":875.25-875.106: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":909.25-909.163: error: Element “img” is missing required attribute “src”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":909.25-909.163: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: error: Element “img” is missing required attribute “src”. "file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: error: An “img” element must have an “alt” attribute, except under certain conditions. For details, consult guidance on providing text alternatives for images.
https://github.com/quarto-dev/quarto-cli/discussions/6987 plus need for me to apply https://quarto.org/docs/websites/website-listings.html#listing-fields to current post
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":943.25-943.111: info: Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.
This is for lines such as <p class="card-img-top"><img data-src="mountain.jpg" style="height: 150px;" class="thumbnail-image card-img"/></p>
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":987.1-987.66: info warning: The “type” attribute is unnecessary for JavaScript resources.
This refers to <script id="quarto-html-after-body" type="application/javascript">
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":1131.54-1151.17: info warning: Document uses the Unicode Private Use Area(s), which should not be used in publicly exchanged documents. (Charmod C073)
"file:/home/maelle/Documents/cynkra/cynkrablog/docs/index.html":1527.1-1527.17: error: Element “script” must not have attribute “async” unless attribute “src” is also specified or unless attribute “type” is specified with value “module”.
Maybe <script async="">
@DivadNojnarg do the very last two lines of the comment above make sense to you? How could we tweak the script you created to avoid the validator error?
I'll come back to this issue next week, now that I can run the validator. :smile_cat:
https://github.com/quarto-dev/quarto-cli/discussions/7489