BioJulia / Automa.jl

A julia code generator for regular expressions
Other
188 stars 15 forks source link

Safe with invalid UTF-8? #63

Closed cmcaine closed 3 years ago

cmcaine commented 3 years ago

Julia base has an issue where invalid UTF-8 strings are matched incorrectly, possibly leading to accessing memory out of bounds:

s = "a\xffb"
# Both of these should match but don't
match(r"a.*b", s) # nothing
match(r".*", s) # BoundsError: attempt to access 3-codeunit String at index [8]

The same regexes and input give a good result with Automa, but I wonder if Automa is guaranteed to be safe?

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

r = re"a.*b"
r.actions[:enter] = [:start]
r.actions[:exit] = [:end]

machine = Automa.compile(r)

actions = Dict(:start => :(s = p),
               :end => :(e = p-1))

context = Automa.CodeGenContext()
@eval function f(data::String)
    s = e = 0
    $(Automa.generate_init_code(context, machine))
    p_end = p_eof = lastindex(data)
    $(Automa.generate_exec_code(context, machine, actions))
    return s, e, cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end

f("a\xffb") # 1, 3, :ok
jakobnissen commented 3 years ago

Yes, that is guaranteed. Automa does not operate on String or Chars as such, only on AbstractVector{UInt8} and UInt8. Any non-ASCII characters (including invalid UTF8) are transformed to bytes first, so e.g. "ϵ" corresponds to 0xcf * 0xb5. The drawback is that the machine must advance every byte, not at every character. This can lead to some quite tricky situations, for example https://github.com/BioJulia/FASTX.jl/pull/28, which was caused by the machine advancing state to "finished reading record" after seeing the first byte in a Windows newline \r\n, after which it saw an unexpected \n and crashed.

jakobnissen commented 3 years ago

Also, sorry for not seeing the issue before now. I've begun "watching" this repo now to not miss future issues/bugs.

cmcaine commented 3 years ago

Thank you for your response, and no worries about the delay!

I am asking because I was wondering if it might be better to write some of the regexes in HTTP.jl which encounter untrustworthy bytes as Automa.jl machines.

jakobnissen commented 3 years ago

@cmcaine I'm closing this issue as resolved. You're welcome to open another one or ping me if you disagree.