Charcoal-SE / metasmoke

Web dashboard for SmokeDetector.
https://metasmoke.erwaysoftware.com
Creative Commons Zero v1.0 Universal
43 stars 34 forks source link

OR body search fails for last token adjacent to </a> and/or </p> #878

Open tripleee opened 3 years ago

tripleee commented 3 years ago

I wanted to search for this word Ollie just watched:

"OR" search for "sexyloveomegle" in title, body, or username

Zero hits! But if I only search the body, I can find it:

Search for "sexyloveomegle" in body

(The hit is in post 324969.)

Looking for another example, I discovered the same to be true for this one:

"OR" search for "Mukteshwar" in title, body, or username

Again, if I only search for body hits, it's there:

Search for "Mukteshwar" in body

(The hit is post 325248.)

What these two seem to have in common is that the search word is the very last word in the body, and the markup includes adjacent closing HTML tags.

kik: sexyloveomegle</p>
...
<a href="https://spam.example.com/elided" rel="nofollow noreferrer">Camp in Mukteshwar</a></p>

By comparison, where the terminating close tags are preceded by whitespace, the search works. So for example, "OR" search for "Burton" in title, body, or username finds post 324551 which has

...
<p> Harold Burton </p>

as the last line of the post.

(Tangentially, https://metasmoke.erwaysoftware.com/search?utf8=%E2%9C%93&title=sexyloveomegle&body=sexyloveomegle%3C%2Fp%3E&username=sexyloveomegle&or_search=1 gets me a traceback from metasmoke.)

makyen commented 3 years ago

I'd note that using a regex search, which more accurately reflects what a watch or keyword blacklist would search for works fine: sexyloveomegle and Mukteshwar. While that doesn't invalidate this as an issue, it does provide a work-around for most cases. Using a regex without the bookending done by the watchlist/keyword blacklist also works sexyloveomegle and Mukteshwar.

tripleee commented 3 years ago

No doubt; but this requires the searcher to be a registered metasmoke user.

Not being able to share links to searches with users who don't have an account is a major blocker for many situations where I would otherwise much prefer to use regex search.

makyen commented 3 years ago

Yes. It would be nice to be able to save a search in a way that cached the result (so it didn't result in using significant resources if reused within a reasonable period, which would be automatically renewed when next used) and made it available through a short link which could be viewed by non-core users (or some other methodology of reasonably sharing a regex-based search with users who do not have the Core role). There have been multiple times when I would have used such an ability in flags, or even just posting in chat.

Undo1 commented 3 years ago

What if MS kept a list of 'okay' regex queries run by core users, maybe meaning those that ran within a time limit? Then when a non-core user submits a regex query, if it's on that list it's fine.

makyen commented 3 years ago

That would be reasonable; particularly with a time limit and limited to either searches which took < N seconds or for which the results are still in cache. If there is a time limit, it would probably be helpful if it was at least a couple/few days, in order to allow time for moderators to handle a flag which contains such a link.

IIRC, the primary reason we don't permit non-Core access to regex search is that regex searches can result in very substantial consumption of compute/database resources.

Maybe, MS could keep a cache of regex search results, and anything which hit the cache could be served to anyone, There could be a link/button, similar to what's done with Blazer/the SQL Data Explorer which allows a Core user to refresh the search results at any time. That would also result in a user time/compute savings when we're sharing searches in chat.

thesecretmaster commented 3 years ago

I wrote this cache feature for search in #797. It probably could use a bit more polish, but it works pretty well IIRC.