Closed cmcaine closed 3 years ago
Yes, that is guaranteed. Automa does not operate on String
or Char
s as such, only on AbstractVector{UInt8}
and UInt8
. Any non-ASCII characters (including invalid UTF8) are transformed to bytes first, so e.g. "ϵ"
corresponds to 0xcf * 0xb5
.
The drawback is that the machine must advance every byte, not at every character. This can lead to some quite tricky situations, for example https://github.com/BioJulia/FASTX.jl/pull/28, which was caused by the machine advancing state to "finished reading record" after seeing the first byte in a Windows newline \r\n
, after which it saw an unexpected \n
and crashed.
Also, sorry for not seeing the issue before now. I've begun "watching" this repo now to not miss future issues/bugs.
Thank you for your response, and no worries about the delay!
I am asking because I was wondering if it might be better to write some of the regexes in HTTP.jl which encounter untrustworthy bytes as Automa.jl machines.
@cmcaine I'm closing this issue as resolved. You're welcome to open another one or ping me if you disagree.
Julia base has an issue where invalid UTF-8 strings are matched incorrectly, possibly leading to accessing memory out of bounds:
The same regexes and input give a good result with Automa, but I wonder if Automa is guaranteed to be safe?