erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.3k stars 2.94k forks source link

ERL-876: re:run +unicode doesn't yield(?) #3921

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 5 years ago

Original reporter: lelf Affected version: Not Specified Fixed in version: OTP-22.1 Component: erts Migrated from: https://bugs.erlang.org/browse/ERL-876


Seems I don't fully understand {{re}}'s trap business to fix this myself.

This runs fine (OTP 16–22)
{code:erlang}
%% Erlang/OTP 22 [RELEASE CANDIDATE 1] [erts-10.2.4] [source-2052caebf9] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe] [dtrace]
Str = binary:copy(<<"-+foobar\n">>,50000).
re:run(Str, <<"foobar">>, [global]). % → 50000 results
{code}

However, this
{code:erlang}
re:run(Str, <<"foobar">>,       [global,unicode]).
re:run(Str, <<"(*UTF)foobar">>, [global]).     % equiv
{code}
takes absurd amount of time (~6s on MacBook Pro), and locks vm (^G to prompt >1sec). Results are correct (tried with real tricky Unicode).

{code:bash}
# count of function entries
dtrace -p`pgrep beam.smp` -n 'pid$target::*pcre*:entry { @[probefunc]=count() }' 

  erts_pcre_compile2                                                1
  erts_pcre_fullinfo                                                3
  erts_pcre_exec                                                50001
  erts_pcre_free_restart_data                                   50001
  _erts_pcre_valid_utf                                          50002
  erts_erts_pcre_free                                           50002
  erts_erts_pcre_malloc                                         50002
{code}

msacc
{noformat}
        Thread    alloc      aux      bifbusy_wait check_io emulator      ets       gc  gc_full      nif    other     port     send    sleep   timers
 scheduler( 2)    0.35%    0.00%   99.47%    0.01%    0.00%    0.13%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.02%    0.00%
{noformat}

PS pcregrep works fine
{code:erlang}
13 Ɛ⟩ re:version().
<<"8.42 2018-03-20">>
14 Ɛ⟩ os:cmd("pcregrep --version").
"pcregrep version 8.42 2018-03-20\n"

file:write_file("/tmp/str.txt",Str).
21 Ɛ⟩ f(T), {T,_} = timer:tc(os,cmd,["pcregrep '(*UTF)foobar' /tmp/str.txt"]), T.
48046
{code}
OTP-Maintainer commented 5 years ago

rickard said:

Thanks for the bug report!

There were two issues in cooperation causing this:
* UTF8 validation of the subject did not yield and caused no reduction cost
* When the global option is passed the UTF8 validation was performed once for each match

Since the validation did not cost any reductions and the matching in this subject is really cheap, multiple matches including validations were performed in one go without scheduling out the calling process, which in turned caused the scheduler to be blocked for a long time. The repeated validations also caused the whole operation to take a very long time to complete.

I've published a pull request [PR-2250|https://github.com/erlang/otp/pull/2250] that should fix these issues. This (or a modification of it, if any issues are found with it) will at least be released in the next maintenance patch (OTP 22.1).
OTP-Maintainer commented 5 years ago

rickard said:

I have merged [PR-2250|https://github.com/erlang/otp/pull/2250] into the {{maint}} branch now. That is, it will be released in next maintanence patch (OTP 22.1).