Closed rabbtekejos closed 6 months ago
@rabbtekejos Please provide one (preferably 5-10 different) url(s) we can reproduce this bug against.
Notes for when we have test urls: It could be because of missing/malformed encoding info in response headers resulting in wrong encoding used when reading file. We use get_http_content): to get sitemap(s) used response.text for xml sitemaps. According to documentation it uses unicode IF encoding can't be determined by response headers.
I don't know much about Python or encodings but the encoding HTTP header I find tells if the content is compressed and with what algorithm (brotli, gzip etc.) and that does not look like the issue. When I added a print of the sitemap_content variable in get_root_element the text following the BOM is a correct sitemap so it can be decoded/decompressed.
As for how to reproduce this, here is a minimal ASP.NET 8 application, if that's acceptable
using System.Text;
using System.Xml;
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();
app.MapGet("/", () => Results.Ok("This is the startpage, the sitemap can be find under the url /sitemap.xml"));
app.MapGet("/sitemap.xml", async (HttpResponse Response) =>
{
XmlWriterSettings settings = new()
{
Encoding = new UTF8Encoding(encoderShouldEmitUTF8Identifier: true), // Toggle the BOM on/off here
Async = true
};
Response.Headers.ContentType = "text/xml";
XmlWriter writer = XmlWriter.Create(Response.Body, settings);
await writer.WriteStartDocumentAsync();
await writer.WriteStartElementAsync(null, "urlset", "http://www.sitemaps.org/schemas/sitemap/0.9");
await writer.WriteAttributeStringAsync("xmlns", "xhtml", null, "http://www.w3.org/1999/xhtml");
await writer.WriteStartElementAsync(null, "url", null);
await writer.WriteElementStringAsync(null, "loc", null, "https://localhost:5000");
await writer.WriteElementStringAsync(null, "lastmod", null, DateTime.UtcNow.ToString("yyyy-MM-ddThh:mm:ss"));
await writer.WriteStartElementAsync("xhtml", "link", null);
await writer.WriteAttributeStringAsync(null, "rel", null, "alternate");
await writer.WriteAttributeStringAsync(null, "hreflang", null, "sv");
await writer.WriteAttributeStringAsync(null, "href", null, "https://localhost:5000");
await writer.WriteEndElementAsync(); //end xhtml:link
await writer.WriteEndElementAsync(); //end url
await writer.WriteEndElementAsync(); // end urlset
await writer.WriteEndDocumentAsync();
await writer.FlushAsync();
return Results.Empty;
}
);
app.Run();
Toggling between true and false at the marked line will control if a BOM will be present or not.
@rabbtekejos I would have preferred url BUT I think I see the problem/missing part in your code example.
You are not specifying encoding/charset on this line:
Response.Headers.ContentType = "text/xml";
resulting in receiving party MUST use default charset (read: us-ascii
).
As you are encoding your XML with utf-8
you need to specify that the xml is using utf-8
charset/encoding in the contenttype.
If you change that line to the following it should work:
Response.Headers.ContentType = "text/xml; charset=utf-8";
Let me know if that solves your problem :)
It don't seems like I can get webperf_core working when running towards a local site so I can't test what happens when I change or remove the Content Type header.
Removing the BOM from sitemap in our testing environment and running webperf_core improved our score.
@rabbtekejos As long as it is not protected behind login it should be possible to access the local website
closed as there was no new info in issue for a week.
I have been investigating why our site got a low score in the standard files sub-category and found that webperf can't handle the presence of a UTF-8 BOM (Byte Order Mark)
If the robots.txt file begins with a BOM and the first row has the sitemap: instruction then webperf will not fetch and process the sitemap.
If the sitemap.xml file begins with a BOM then get_root_element fails to find the root element.
This likely affects other areas of this and other tests as well.