Webperf-se / webperf_core

webperf-core is an open-source testing suite tailored to help you improve your digital presence in areas like web performance, security and accessibility to email best practice using many small improvements.
https://webperf.se/articles/webperf-core/
MIT License
19 stars 31 forks source link

UTF-8 Byte Order Mark Breaks Test #443

Closed rabbtekejos closed 6 months ago

rabbtekejos commented 6 months ago

I have been investigating why our site got a low score in the standard files sub-category and found that webperf can't handle the presence of a UTF-8 BOM (Byte Order Mark)

If the robots.txt file begins with a BOM and the first row has the sitemap: instruction then webperf will not fetch and process the sitemap.

If the sitemap.xml file begins with a BOM then get_root_element fails to find the root element.

This likely affects other areas of this and other tests as well.

7h3Rabbit commented 6 months ago

@rabbtekejos Please provide one (preferably 5-10 different) url(s) we can reproduce this bug against.

7h3Rabbit commented 6 months ago

Notes for when we have test urls: It could be because of missing/malformed encoding info in response headers resulting in wrong encoding used when reading file. We use get_http_content): to get sitemap(s) used response.text for xml sitemaps. According to documentation it uses unicode IF encoding can't be determined by response headers.

rabbtekejos commented 6 months ago

I don't know much about Python or encodings but the encoding HTTP header I find tells if the content is compressed and with what algorithm (brotli, gzip etc.) and that does not look like the issue. When I added a print of the sitemap_content variable in get_root_element the text following the BOM is a correct sitemap so it can be decoded/decompressed.

As for how to reproduce this, here is a minimal ASP.NET 8 application, if that's acceptable

using System.Text;
using System.Xml;

var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

app.MapGet("/", () => Results.Ok("This is the startpage, the sitemap can be find under the url /sitemap.xml"));

app.MapGet("/sitemap.xml", async (HttpResponse Response) =>
{
    XmlWriterSettings settings = new()
    {
        Encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true), // Toggle the BOM on/off here
        Async = true
    };

    Response.Headers.ContentType = "text/xml";
    XmlWriter writer = XmlWriter.Create(Response.Body, settings);

    await writer.WriteStartDocumentAsync();
    await writer.WriteStartElementAsync(null, "urlset", "http://www.sitemaps.org/schemas/sitemap/0.9");
    await writer.WriteAttributeStringAsync("xmlns", "xhtml", null, "http://www.w3.org/1999/xhtml");

    await writer.WriteStartElementAsync(null, "url", null);
    await writer.WriteElementStringAsync(null, "loc", null, "https://localhost:5000");
    await writer.WriteElementStringAsync(null, "lastmod", null, DateTime.UtcNow.ToString("yyyy-MM-ddThh:mm:ss"));

    await writer.WriteStartElementAsync("xhtml", "link", null);
    await writer.WriteAttributeStringAsync(null, "rel", null, "alternate");
    await writer.WriteAttributeStringAsync(null, "hreflang", null, "sv");
    await writer.WriteAttributeStringAsync(null, "href", null, "https://localhost:5000");
    await writer.WriteEndElementAsync(); //end xhtml:link

    await writer.WriteEndElementAsync(); //end url

    await writer.WriteEndElementAsync(); // end urlset
    await writer.WriteEndDocumentAsync();

    await writer.FlushAsync();

    return Results.Empty;
}
);

app.Run();

Toggling between true and false at the marked line will control if a BOM will be present or not.

7h3Rabbit commented 6 months ago

@rabbtekejos I would have preferred url BUT I think I see the problem/missing part in your code example.

You are not specifying encoding/charset on this line: Response.Headers.ContentType = "text/xml"; resulting in receiving party MUST use default charset (read: us-ascii).

As you are encoding your XML with utf-8 you need to specify that the xml is using utf-8 charset/encoding in the contenttype. If you change that line to the following it should work:

Response.Headers.ContentType = "text/xml; charset=utf-8";

Let me know if that solves your problem :)

rabbtekejos commented 6 months ago

It don't seems like I can get webperf_core working when running towards a local site so I can't test what happens when I change or remove the Content Type header.

Removing the BOM from sitemap in our testing environment and running webperf_core improved our score.

7h3Rabbit commented 6 months ago

@rabbtekejos As long as it is not protected behind login it should be possible to access the local website

7h3Rabbit commented 6 months ago

closed as there was no new info in issue for a week.