cu / silicon

Silicon Notes, a web-based personal knowledge base with few frills
Other
220 stars 7 forks source link

HTML(-like) elements in preformatted text removed in search results #13

Closed cu closed 6 months ago

cu commented 6 months ago

This is just a preliminary report for now, more investigation is needed.

  1. Create a page with a preformatted section, either block or inline style.
  2. Inside that preformatted section, create an HTML-like tag with some string that is fairly unique, e.g. onetwotest.
  3. Execute a search with that string.
  4. Observe that the string is found in the search, but does not appear in the search results.

This likely happens because HTML(-like) tags are intentionally getting filtered out of search results to prevent page content from affecting the style of the search results. But it turns out that this is a heavy-handed approach that removes things are are not HTML tags, e.g. placeholders such as https://<some_domain>.com.

Running the results through an HTML escape function of some kind would be better.

cu commented 6 months ago

This is happening in the mark_query_results() Jinja filter, which does three things:

  1. The snippet is passed to markupsafe.Markup.striptags()
  2. The result of that is passed to the built-in html.escape().
  3. The result of that is filtered through an re search-and-replace to HTMLify query highlighting in the snippet.

In order to do this, striptags() would have to be made aware of Markdown's two preformatted text syntaxes, which is not something I'm super interested in today. The other option is to remove striptags() from the chain and live with HTML in the search results. (Which might be tolerable since HTML instead of Markdown on my pages is vanishingly rare.)

All told, for now at least I believe I'm going to mark this as won't fix and just try to use parens or curly braces for placeholder values in preformatted text from now on.