CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.48k stars 113 forks source link

Expand on the provided CLI glob options #128

Open bglw opened 1 year ago

bglw commented 1 year ago

Discussed in https://github.com/CloudCannon/pagefind/discussions/127

  1. Provide functionality for the glob option to take a list of globs
  2. Provide an inverse exclude_glob option to help setups which need to exclude only a few files
jaygooby commented 1 year ago

You may not need to implement exclude_glob, because you can already exclude matches from a glob pattern, although it's not as simple as specifying which folders to ignore.

I was having the same issue as @ndeville in https://github.com/CloudCannon/pagefind/discussions/127 where my Jekyll _site/tags folder was causing duplicate results to appear.

I excluded it with this glob pattern: *[!s]/**/*.{html} which says match any .html in any folder, except those that end with an s, hence tags is excluded. But you need to be careful that you don't have other folders ending with s, or these will be ignored too. I use it like this in pagefind.yml

glob: "*[!s]/**/*.{html}"

Before this I was seeing

Total:
   Indexed 1 language
   Indexed 126 pages

and now I get

Total:
   Indexed 1 language
   Indexed 116 pages

and my duplicate results have disappeared.

You can check what folders would be indexed if you use the glob in ls like this:

ls -d _site/*[!s]

and confirm that there's no tags folder.

Karlstens commented 8 months ago

Discussed in #127

1. Provide functionality for the `glob` option to take a list of globs

2. Provide an inverse `exclude_glob` option to help setups which need to exclude only a few files

I would certainly appreciate an exclude_glob so that I can block specified folders/files from being indexed/returned by pagefind. I attempted a couple of ideas from what Jaygooby suggested, none of which worked in my sandbox.