Unable to forceLanguage via JS API

c-w commented 5 months ago

Problem

Using the JS API to create an index, forceLanguage doesn't seem to have any effect.

Repro

Steps:

Save the following file as test.mjs
Run node test.mjs
Open the browser at http://localhost:3000/index.html
Search for "shit"

Actual behavior:

Document containing "shy" is returned (stemming is applied)

Expected behavior:

No document is returned (stemming isn't applied)

import http from "node:http";
import { rm, mkdir, writeFile } from "node:fs/promises";
import { createReadStream, stat } from "fs";
import path from "node:path";
import * as pagefind from "pagefind";
import { fileURLToPath } from "url";
import mime from "mime";

void async function() {
  const domain = "http://foo.com";

  const { index } = await pagefind.createIndex({
    forceLanguage: "unknown"
  });

  await index.addCustomRecord({
    url: domain + "/shy.html",
    content: "not shy of using words",
    language: "en"
  });

  const outputPath = path.join(path.dirname(fileURLToPath(import.meta.url)), "demo");
  await rm(outputPath, { force: true, recursive: true });
  await mkdir(outputPath);
  await index.writeFiles({ outputPath });

  await writeFile(path.join(outputPath, "index.html"), `
<link href="/pagefind-ui.css" rel="stylesheet">
<script src="/pagefind-ui.js"></script>
<div id="search"></div>
<script>
    window.addEventListener("DOMContentLoaded", () => {
        new PagefindUI({ element: "#search", showSubResults: true });
    });
</script>
  `, "utf-8");

  const server = http.createServer((req, res) => {
    const url = new URL(domain + req.url);
    const filePath = path.join(outputPath, url.pathname);
    stat(filePath, (err, stat) => {
      if (err || !stat.isFile()) {
        res.writeHead(404);
        res.end("not found");
      } else if (stat.isFile()) {
        res.writeHead(200, {
          "Content-Length": stat.size,
          "Content-Type": mime.getType(filePath)
        });
        createReadStream(filePath).pipe(res);
      }
    });
  });

  server.listen(3000);
}();

Work-around

Applying the following patch fixes the problem, however, according to the documentation I'd expect to be able to set forceLanguage once on the top-level configuration and not have to do it for every document. Perhaps the documentation should be updated or precedence given to the top-level configuration item instead of the document-level value.

@@ -9,14 +9,12 @@ import mime from "mime";
 void async function() {
   const domain = "http://foo.com";

-  const { index } = await pagefind.createIndex({
-    forceLanguage: "unknown"
-  });
+  const { index } = await pagefind.createIndex({});

   await index.addCustomRecord({
     url: domain + "/shy.html",
     content: "not shy of using words",
-    language: "en"
+    language: "unknown",
   });

   const outputPath = path.join(path.dirname(fileURLToPath(import.meta.url)), "demo");

Context

Pagefind version: 1.0.4

bglw commented 5 months ago

Ah, good find!

The addCustomRecord() flow is stepping around the function that sets this — using addHTMLFile() would override the language as you're expecting.

Will fix so that it overrides both cases 👍

c-w commented 5 months ago

There's another strange behavior I noticed related to stemming. If we add a few more test to the example above:

  await index.addCustomRecord({
    url: domain + "/c.html",
    content: "industrialist and General Motors co-founder William C. Durant",
    language: "unknown"
  });

  await index.addCustomRecord({
    url: domain + "/p.html",
    content: "George P. Knapp",
    language: "unknown"
  });

Now searching for "poop" or "crap" will match the single letter tokens P and C which is quite unexpected to me.

bglw commented 5 months ago

PR created for the language fix + test case: https://github.com/CloudCannon/pagefind/pull/552

Re: the strange behavior, that's currently intentional, though indeed here isn't the most useful. Pagefind really likes giving some result over nothing. One way it does that is to trim the search term back until it finds a search term that would match — the idea being that if you type generalx it gets trimmed back to general. There's no escape hatch on this though, so it will trim it back to one character if need be.

It's an open area for improvement — hopefully one day getting some better typo tolerance features in place will allow us to ease back on this one to something a little more intuitive 🙂

c-w commented 5 months ago

Thanks for the explanation. For now I'll hack around it by client-side parsing the excerpt and filtering out any matches where the mark is shorter than some threshold.

bglw commented 5 months ago

v1.0.5-rc2 has been published with the fix for forceLanguage 🙂

Will leave this issue alive til it hits stable.

bglw commented 3 months ago

This has landed in the v1.1.0 release 🎉

CloudCannon / pagefind