SimpleMachines / SMF

Simple Machines Forum — SMF in short — is free and open-source community forum software, delivering professional grade features in a package that allows you to set up your own online community within minutes!
https://www.simplemachines.org/
Other
592 stars 255 forks source link

[2.1]: Long multi-byte words dropped in log_search_words #8312

Open sbulen opened 1 month ago

sbulen commented 1 month ago

Basic Information

The problem here is hard to see: long words with multi-byte characters don't make it into log_search_words, they are dropped.

Lots of subtleties here, but the core issue is a non-mb-safe substring is taken.

The sequence of events:

Note, if a text2words is called during a background task, an error is logged: Cron error: 8192: strlen(): Passing null to parameter # 1 ($string) of type string is deprecated (load.php, line 182)

This error is suppressed in the app, as deprecation errors are still suppressed in index.php. But not in cron.php.

Similar (but different) report: https://github.com/SimpleMachines/SMF/issues/6405

Bigger issue? The above term isn't actually a word, it's a sentence...

This issue exists both in 2.1 & 3.0. Even when cutting over to UTF8MB4 in 3.0, it may still exist, depending on whether/how the smf truncate function is rewritten.

Steps to reproduce

  1. Create a new post with this in the subject: 三藩市道德委員會收到投訴:針對政治捐款「打包」組織
  2. Post it

Expected result

A word in log_search_words

Actual result

No words in log_search_words

Version/Git revision

3.0 alpha 2 & 2.1.4

Database Engine

All

Database Version

8.4

PHP Version

8.3.8

Logs

No response

Additional Information

No response

sbulen commented 3 weeks ago

I can no longer reproduce this with 3.0. I believe @Sesquipedalian fixed the issue in 3.0 with #8298 .

In fact, I think #8298 fixed my broader concern above that we weren't properly breaking on words. E.g., 3.0 now properly recognizes that 委員會 = "committee", and places a single entry into log_search_words for that portion of the test string above.

Very cool.

Issue still exists with 2.1.