Closed JannisBush closed 3 months ago
This feels quite verbose when so many fields are empty:
Can we only include the fields when they are present to reduce the storage and query size?
Don't know about the storage and query size of empty arrays. The keys are always the same so there might be some optimization possible.
However, I adapted the query to only keep non-empty fields. Hope that does not make the query more complex.
"/.well-known/security.txt": {
"found": false,
"data": {
"status": 404,
"redirected": false,
"url": "https://example.com/.well-known/security.txt",
"signed": false,
"other": [
[
"background-color",
"#f0f0f2;"
],
[
"margin",
"0;"
],
[
"padding",
"0;"
],
Was not great, as the inline CSS is detected as Other directives.
I now also save the content-type (MUST be text/plain according to the spec but unclear if all sites follow the spec, there are probably quite some sites that do not set any content-type header 🤔)
Additionally, I only save the data is the status is of type ok
(r.ok
has to be true).
This fixes the case of example.com
which returns an HTML document with status 404, however sites that return their landing page or similar at /.well-known/security.txt
with a 200 status code would still be parsed.
Unsure, how to best handle such cases without introducing false negatives.
Ideas:
text/plain
(Misses sites that do not set any content-type header or another one)Other
values (Misses all lowercase sites)<!doctype html><html>
are presentr.ok
is required is already enough 🤔 @tunetheweb Can this be merged before the crawl starts tomorrow?
As written above there might still be a a very small number of sites with incorrect "other" values. However, I think this does not pose a major problem:
Updated custom metric for https://github.com/HTTPArchive/almanac.httparchive.org/issues/3604
Description of the changes: Update the parsing of
.well-known/security.txt
to take all new defined fields into account, save undefined/future/custom fields and a basic parsing of whether the file is valid (required fields exist and no field that is only allowed to occur once occurs more than once).Test websites: