Charcoal-SE / metasmoke

Web dashboard for SmokeDetector.
https://metasmoke.erwaysoftware.com
Creative Commons Zero v1.0 Universal
43 stars 34 forks source link

Emoji in regex breaks MS search #838

Open tripleee opened 3 years ago

tripleee commented 3 years ago

What problem has occurred? What issues has it caused?

Charcoal-SE/SmokeDetector#5550 links to https://metasmoke.erwaysoftware.com/search?utf8=%E2%9C%93&body_is_regex=1&body=%28%3Fs%3A%5Cb%5B%5Cs.%3E%5D%2A%F0%9F%98%8D%F0%9F%98%8D%2B%5CW%2A%5Cb%29 which however produces a Ruby traceback for me.

Mysql2::Error: Got error 'nothing to repeat at offset 14' from regexp: SELECT COUNT(*) AS count_all, `posts`.`is_tp` AS posts_is_tp, `posts`.`is_fp` AS posts_is_fp, `posts`.`is_naa` AS posts_is_naa FROM `posts` WHERE (IFNULL(`posts`.`body`, '') REGEXP '(?s:\\b[\\s.>]*😍😍+\\W*\\b)') GROUP BY `posts`.`is_tp`, `posts`.`is_fp`, `posts`.`is_naa`

  respond_to do \|format\|
       format.html do
>>>      @counts_by_accuracy_group = @results.group(:is_tp, :is_fp, :is_naa).count
         @counts_by_feedback = %i[is_tp is_fp is_naa].each_with_index.map do \|symbol, i\|
           [symbol, @counts_by_accuracy_group.select { \|k, _v\| k[i] }.values.sum]
         end.to_h

What would you like to happen/not happen?

The regex is not really wrong; the search should run and show the hits, instead of crash.

Looks like the regex engine in MariaDB doesn't think an emoji is something you can repeat? Dunno if we can devise a workaround or should just defer this upstream.

tripleee commented 3 years ago

Just https://metasmoke.erwaysoftware.com/search?utf8=%E2%9C%93&body_is_regex=1&body=%F0%9F%98%8D stunningly crashes with "nothing to repeat" so it's the emoji itself which produces the error.

tripleee commented 2 years ago

Related: https://chat.stackexchange.com/transcript/message/61347168#61347168

makyen commented 2 years ago

This appears to be a limitation in the Regex implementation which is used in the database. It doesn't accept, or ignores, characters which are > 0xFFFF (either as characters or as Unicode escapes; e.g. \x{0b03}, which can have a max of 4 hex digits), so a lot of emoji just won't be recognized.