CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.48k stars 113 forks source link

Find text in embedded svg #245

Open Heinrichsgeist opened 1 year ago

Heinrichsgeist commented 1 year ago

Embedded svg images in web pages may contain meaningful text, e.g. in diagrams. It would be great if pagefind was able to find also this text.

In the example svg file attached visible text should be found: "unboxed text", "shape internal text", "shape hyperlink", "hyperlink inside". Text-containing-svg

bglw commented 1 year ago

Good suggestion — I'll look at the best way to achieve this. Right now SVG elements are blanket disallowed:

https://github.com/CloudCannon/pagefind/blob/d6bcff2c725dd7ac9f2d44cc71a46ba98814a15d/pagefind/src/fossick/parser.rs#L27-L30

The two best options I can see would be to add an option for the CLI — currently exclude-selectors is available which adds to that list, but an include-selectors option could be added to override it, or a --strict-exclude-selectors could be added to make the existing exclude-selectors an override rather than a merge.

The second option would be a data attribute that can include individual elements, so you would tag something like:

<svg data-pagefind-index > ... </svg>

to include a single SVG.

What would be the most ergonomic option for you?

Heinrichsgeist commented 1 year ago

The most ergonomic option for my use case (most of my svg should be indexed) was a CLI option, because I cannot simply inject the data attribut with SSG means, since the svg code is generated by an external drawing tool.

Maybe it would be good to be able control default behaviour via CLI, and additionally to be able to add exceptions for certain individual svg code blocks. (for my use case: "pagefind, please ignore this svg").