Open sanyarnd opened 1 year ago
That just means your platform's default encoding isn't UTF-8. Do you need to use UTF-8? If your platform encoding does all you need, not forcing UTF-8 should work just fine.
Can you clarify a bit? I am using Ubuntu and it's using UTF-8 by default for everything, source file is also UTF-8, same goes for JVM.
Your JVM is obviously not using UTF-8. Check the value of the file.encoding system property.
That's funny, I get the same error here, but if I remove HS_FLAG_UTF8 it works fine. Can you confirm? That sounds like a bug in Hyperscan...
Sounds like issue https://github.com/intel/hyperscan/issues/362 maybe?
Yeah everything works ok without the flag. Seems like the case indeed. Strange it's not reproduced with other bindings.
Looking through code it seems it's not the reason
// Invalid one-byte caseS.
// Valid one-byte caseS.
{"\x7f", false},
{"\x7f", true}, // \x7f is valid
I have Cyrillic symbols, so it's not case.
Also note that turning the flag off disables caseless UTF-8 matching, for example, the following test fails now:
text="Ошибка" pattern=".*ош.*"
Again, the other bindings work as expected, so I tend to think there's something off with bindings itself.
I have the following set of flags:
HS_FLAG_CASELESS | HS_FLAG_SINGLEMATCH | HS_FLAG_UCP | HS_FLAG_UTF8
We can get UTF-8 bytes with String.getBytes() and call functions directly with a pointer to that and it still fails. So unless there's something wrong with how the JVM encodes in UTF-8, it's a problem with Hyperscan itself. Other bindings may offer some sort of workaround for that. Someone should investigate that.
Oh look at that. When building from the develop branch, the error doesn't show up anymore. So the bug has been fixed since 5.4.0. I guess we could upgrade the presets to the develop branch since they don't appear to make releases anymore...
Looks like that's been fixed in 5.4.1. Please give it a try with the snapshots: http://bytedeco.org/builds/
Ah, no, this is still happening with 5.4.1, but only when it gets built on CentOS 7. When I build locally with Fedora 36, this isn't happening. So I guess this is a bug in the compiler... Well, CentOS 7 EOL is next year, and then I think I'll just switch all builds to Ubuntu, and that should take care of this bug. If you need something before that, either build from source with a compiler that works, or figure out how to make this work with compilers available for CentOS 7!
The problem perists with 5.4.2-1.5.9 when using maven central artifact.
When I compiled the preset myself it worked fine, so it must have been something the way it was compiled.
import org.bytedeco.hyperscan.hs_database_t;
import static org.bytedeco.hyperscan.global.hyperscan.*;
public class Sample {
public static void main(String[] args) {
try (var db = new hs_database_t();
var err = new hs_compile_error_t()) {
var code = hs_compile("测", HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err);
var errMsg = "";
if (!err.isNull()) {
errMsg = err.message().getString();
}
System.out.printf("Word: '%s'; retcode '%d'; message '%s' %n", "测", code, errMsg);
}
}
}
outputs Word: '测'; retcode '-4'; message 'Expression is not valid UTF-8.'
More explicitly, hs_compile(new BytePointer("测".getBytes(StandardCharsets.UTF_8)), HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err)
fails too. It is a valid 3-byte UTF-8 sequence btw.
This should be fixed now. Please give it a try with the snapshots: http://bytedeco.org/builds/
Still fails.
I wonder which version of the compiler fixes that...
So far I figured the bug occurs in Parser.cpp which is generated by Ragel. Centos 7 uses Development (!) version 7 of Ragel, while Ragel 6 seems to work fine. I haven't managed to install Ragel 6 on Centos 7.
Ok, I found it. It is Ragel. Version 6 has to be used. I had it compiled manually because no package for Centos 7 exists. Also had to build cmake, bmake, libfsm and kelbt. What a fun day.
EDIT: Here is ragel 6.10 binary. It's self-contained. ragel.gz
Yes, I see, thanks. I forgot to update the builds to Ubuntu. Now the snapshots work. :+1:
Something is broken with Hyperscan build. The following code doesn't work.
Output:
Meanwhile, https://github.com/gliwka/hyperscan-java binaries work as expected