CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.34k stars 100 forks source link

HTML entities in custom metadata titles #459

Closed bglw closed 9 months ago

bglw commented 11 months ago

I was about to write a larger note about how I cannot reproduce this, but it only seems to apply if you're using a custom data-pagefind-meta="title" attribute. The automatic h1 title capture works fine — so that will make the fix simpler (and also means this doesn't affect most sites).

Originally posted by @bglw in https://github.com/CloudCannon/pagefind/issues/24#issuecomment-1726652263

tbroyer commented 10 months ago

Note that this also applies to data-pagefind-filter.

sudomistress commented 10 months ago

I'll add that when using a data-pagefind-meta="title" and using the below code, the titles are rewritten for the first results shown. However if any of those titles are behind the "Load more results" button, it does not apply.

PagefindUI({
  processResult: function (result) { 
     result.meta.title = result.meta.title.replace("&", "&");
     // any other replace desired, even when using regex 
  }
})
olets commented 9 months ago

I'm not getting this for my titles. Maybe something about my build process?

I am, however, getting it for meta.image URLs.

Temp fix which works for me for hits shown initially as well as hits behind the "load more" button

PagefindUI({
  processResult: function (result) { 
    result.meta.image = result.meta.image.replaceAll("&", "&");
    return result;
  }
})
bglw commented 9 months ago

Hi all — thanks for the patience. This is fixed for all metadata and filters in #499 , which will go out with the next release 🙂

bglw commented 9 months ago

Released in v1.0.4 🎉

olets commented 8 months ago

I still get & in result.meta.image in v1.0.4.

I'm seeing

data-pagefind-image="https://my.cool.image/file.jpg?my=url&search=params"

result in the Default UI result thumbnail

<img src="https://my.cool.image/file.jpg?my=url&amp;search=params">

Let me know if I should open a new issue.

tbroyer commented 8 months ago

That's valid and expected HTML. Your original HTML wasn't actually.

olets commented 8 months ago

Your original HTML wasn't actually.

This makes me think we might be misunderstanding each other. Just in case, I'll clarify 😃 If you don't want to change this behavior that's cool. My earlier comment is an easy enough workaround.

It's valid — attribute values can have &. Maybe I tripped you up by leaving out how I connect it to Pagefind? I'm doing

<img
  data-pagefind-meta="image[data-pagefind-image]"
  data-pagefind-image="https://my.cool.image/file.jpg?my=url&search=params"
  …
>

Or maybe my fake URL tripped you up? I use an image service that has a URL param API for processing the image. Pagefind's encoding & as &amp; means I'm not showing the image I intend to show.

tbroyer commented 8 months ago

If you don't want to change this behavior that's cool.

I'm not a maintainer so I have no control on that :slightly_smiling_face:

It's valid — attribute values can have &.

Sorry, I thought it would be a parse error, it's not (source). My bad.

Pagefind's encoding & as &amp; means I'm not showing the image I intend to show.

So you mean that Pagefind uses https://my.cool.image/file.jpg?my=url&amp;search=params as the URL, which in HTML would actually be written as <img src="https://my.cool.image/file.jpg?my=url&amp;amp;search=params"> ? (notice the doubled amp;amp;) This is what tripped me up, because otherwise <img src="https://my.cool.image/file.jpg?my=url&amp;search=params"> and <img src="https://my.cool.image/file.jpg?my=url&search=params"> are strictly equivalent.

bglw commented 8 months ago

👋 Hey @olets

I believe everything is working as intended — though I'm surprised you're seeing an issue with this. Do you have a link to a live site where I can reproduce this breaking?

Context: I added your attributes to a local copy of the Pagefind docs:

data-pagefind-meta="image[data-pagefind-image]"
data-pagefind-image="https://my.cool.image/file.jpg?my=url&search=params"

And indexed. Through the JS API, I see:

const pagefind = await import("/pagefind/pagefind.js");
const search = await pagefind.search("pagefind");
const result = await search.results[0].data();
console.log(result.meta);
/*
{
  "title": "Community Resources",
  "image": "https://my.cool.image/file.jpg?my=url&search=params"
}
*/

Testing the same data flowing into the Default UI, Svelte does explicitly encode the attribute:

<img class="..." src="https://my.cool.image/file.jpg?my=url&amp;search=params" alt="...">

But this is just standard HTML encoding — it should have no effect on the value's usage.

Indeed, in my network tab the request for this image fires off as:

https://my.cool.image/file.jpg?my=url&search=params

At the point this request is hitting your image service, the & shouldn't be encoded in any way.

If you have a link you can share though I'd be more than happy to take a closer look and see if anything else is going on 🙂

olets commented 8 months ago

@tbroyer

I'm not a maintainer

My bad didn't look closely enough at who I was talking to. Thanks for your reply.

<img src="https://my.cool.image/file.jpg?my=url&amp;search=params"> and <img src="https://my.cool.image/file.jpg?my=url&search=params"> are strictly equivalent

In the first, the URLSearchParams are "my" and "amp;search", not "my" and "search".

--

@bglw I got a test branch ready for you…… and can no longer replicate the problem 🤷 🎉 I'll email if it comes up again.

bglw commented 8 months ago

<img src="https://my.cool.image/file.jpg?my=url&amp;search=params"> and <img src="https://my.cool.image/file.jpg?my=url&search=params"> are strictly equivalent

In the first, the URLSearchParams are "my" and "amp;search", not "my" and "search".

Pasting that URL directly into an address bar will indeed be my and amp;search, but if that element is in HTML then when parsed it will be my and search either way — the &amp; is decoded as HTML by the browser before the image element actually gets the value.

A good way to see this is if you inspect element on any GitHub avatar image, right click the element and copy outerHTML, you'll see:

<img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/40188355?s=80&amp;v=4" width="40" height="40" alt="@bglw">

Potentially if it seemed to recur after the update, Pagefind was still cached to a prior version. Pre 1.0.4 it would have been written into the HTML as https://my.cool.image/file.jpg?my=url&amp;amp;search=params, which would be decoded one layer to ?my=url&amp;search=params which is indeed my and amp;search.

Hope that makes sense! 🙂 But yes do let me know if you see it pop up here or elsewhere.