CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.22k stars 97 forks source link

Special characters (p.e. Umlauts) in page's file names are not escaped, causing not working links #619

Open dirk68-fu opened 1 month ago

dirk68-fu commented 1 month ago

Hi,

consider the following two files:

  1. index.html

    <!DOCTYPE html>
    <html lang="de">
    <head>
    <meta charset="UTF-8">
    <link href="./pagefind/pagefind-ui.css" rel="stylesheet">
    <script src="./pagefind/pagefind-ui.js"></script>
    <link href="./styles.css" rel="stylesheet">
    <script>
        window.addEventListener('DOMContentLoaded', (event) => {
            new PagefindUI({
                element: "#search",
                pageSize: 50
            });
        });
    </script>
    </HEAD>
    <BODY>
    <H1>Pagefind test</H1>
    <div id="search"></div>
    </BODY>
    </HTML>
  2. Müsli.html

    <!DOCTYPE html>
    <html lang="de">
    <head><title>Gesundes Müsli</title>
        <meta charset="UTF-8"/>
        <link rel="stylesheet" href="styles.css" type="text/css"/>
    </head>
    <body>
    <H1>Gesundes Müsli</H1>
    Sehr lecker!
    </body>
    </html>

and run pagefind --force-language de --serve --site . the resulting page lists "Müsli" in the search results if appropriate but the URL contains the umlaut "ü" and does not work cross platform (it works locally though). The URL from the search results to the Müsli page is http://.../Müsli.html instead of http://.../M%C3%BCsli.html

dirk68-fu commented 1 month ago

After writing the above I realised: Perhaps this isn't a pagefind problem after all. Something on the way to the server changes the encoding from the canonical form I use locally to something different and the index was created locally in my use case. Because I am no expert on this topic, I wan't to leave this open for now, in case someone has some helpful hints how to fix this.

bglw commented 1 month ago

Ah hmm, no I think Pagefind should fix this. M%C3%BCsli.html is indeed the correct urlencoding of Müsli.html — Pagefind isn't doing this urlencoding though, which is a bug. Will fix