Expand on the provided CLI glob options

bglw commented 1 year ago

Discussed in https://github.com/CloudCannon/pagefind/discussions/127

Provide functionality for the glob option to take a list of globs
Provide an inverse exclude_glob option to help setups which need to exclude only a few files

jaygooby commented 1 year ago

You may not need to implement exclude_glob, because you can already exclude matches from a glob pattern, although it's not as simple as specifying which folders to ignore.

I was having the same issue as @ndeville in https://github.com/CloudCannon/pagefind/discussions/127 where my Jekyll _site/tags folder was causing duplicate results to appear.

I excluded it with this glob pattern: *[!s]/**/*.{html} which says match any .html in any folder, except those that end with an s, hence tags is excluded. But you need to be careful that you don't have other folders ending with s, or these will be ignored too. I use it like this in pagefind.yml

glob: "*[!s]/**/*.{html}"

Before this I was seeing

Total:
   Indexed 1 language
   Indexed 126 pages

and now I get

Total:
   Indexed 1 language
   Indexed 116 pages

and my duplicate results have disappeared.

You can check what folders would be indexed if you use the glob in ls like this:

ls -d _site/*[!s]

and confirm that there's no tags folder.

Karlstens commented 8 months ago

Discussed in #127

1. Provide functionality for the `glob` option to take a list of globs

2. Provide an inverse `exclude_glob` option to help setups which need to exclude only a few files

I would certainly appreciate an exclude_glob so that I can block specified folders/files from being indexed/returned by pagefind. I attempted a couple of ideas from what Jaygooby suggested, none of which worked in my sandbox.

CloudCannon / pagefind

Expand on the provided CLI glob options #128

Discussed in https://github.com/CloudCannon/pagefind/discussions/127

Discussed in #127