bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.65k stars 736 forks source link

[Hyperscan] Broken UTF-8 support #1308

Open sanyarnd opened 1 year ago

sanyarnd commented 1 year ago

Something is broken with Hyperscan build. The following code doesn't work.

import static org.bytedeco.hyperscan.global.hyperscan.HS_FLAG_UCP;
import org.bytedeco.hyperscan.hs_compile_error_t;
import org.bytedeco.hyperscan.hs_database_t;
import static org.bytedeco.hyperscan.global.hyperscan.HS_FLAG_UTF8;
import static org.bytedeco.hyperscan.global.hyperscan.HS_MODE_BLOCK;
import static org.bytedeco.hyperscan.global.hyperscan.hs_compile;

public class Sample {
    public static void main(String[] args) {
        String[] patterns = {"ok.*", "not ok кириллица.*"};

        for (var pattern : patterns) {
            try (var db = new hs_database_t();
                 var err = new hs_compile_error_t()) {
                var code = hs_compile(pattern, HS_FLAG_UTF8 | HS_FLAG_UCP, HS_MODE_BLOCK, null, db, err);

                var errMsg = "";
                if (!err.isNull()) {
                    errMsg = err.message().getString();
                }
                System.out.printf("Word: '%s'; retcode '%d'; message '%s' %n", pattern, code, errMsg);
            }
        }
    }
}

Output:

Word: 'ok.*'; retcode '0'; message '' 
Word: 'not ok кириллица.*'; retcode '-4'; message 'Expression is not valid UTF-8.' 

Meanwhile, https://github.com/gliwka/hyperscan-java binaries work as expected

saudet commented 1 year ago

That just means your platform's default encoding isn't UTF-8. Do you need to use UTF-8? If your platform encoding does all you need, not forcing UTF-8 should work just fine.

sanyarnd commented 1 year ago

Can you clarify a bit? I am using Ubuntu and it's using UTF-8 by default for everything, source file is also UTF-8, same goes for JVM.

saudet commented 1 year ago

Your JVM is obviously not using UTF-8. Check the value of the file.encoding system property.

saudet commented 1 year ago

That's funny, I get the same error here, but if I remove HS_FLAG_UTF8 it works fine. Can you confirm? That sounds like a bug in Hyperscan...

saudet commented 1 year ago

Sounds like issue https://github.com/intel/hyperscan/issues/362 maybe?

sanyarnd commented 1 year ago

Yeah everything works ok without the flag. Seems like the case indeed. Strange it's not reproduced with other bindings.

sanyarnd commented 1 year ago

Looking through code it seems it's not the reason

    // Invalid one-byte caseS.
    // Valid one-byte caseS.
    {"\x7f", false},
    {"\x7f", true}, // \x7f is valid

I have Cyrillic symbols, so it's not case.

Also note that turning the flag off disables caseless UTF-8 matching, for example, the following test fails now:

text="Ошибка" pattern=".*ош.*"

Again, the other bindings work as expected, so I tend to think there's something off with bindings itself.

I have the following set of flags:

HS_FLAG_CASELESS | HS_FLAG_SINGLEMATCH | HS_FLAG_UCP | HS_FLAG_UTF8
saudet commented 1 year ago

We can get UTF-8 bytes with String.getBytes() and call functions directly with a pointer to that and it still fails. So unless there's something wrong with how the JVM encodes in UTF-8, it's a problem with Hyperscan itself. Other bindings may offer some sort of workaround for that. Someone should investigate that.

saudet commented 1 year ago

Oh look at that. When building from the develop branch, the error doesn't show up anymore. So the bug has been fixed since 5.4.0. I guess we could upgrade the presets to the develop branch since they don't appear to make releases anymore...

saudet commented 1 year ago

Looks like that's been fixed in 5.4.1. Please give it a try with the snapshots: http://bytedeco.org/builds/

saudet commented 1 year ago

Ah, no, this is still happening with 5.4.1, but only when it gets built on CentOS 7. When I build locally with Fedora 36, this isn't happening. So I guess this is a bug in the compiler... Well, CentOS 7 EOL is next year, and then I think I'll just switch all builds to Ubuntu, and that should take care of this bug. If you need something before that, either build from source with a compiler that works, or figure out how to make this work with compilers available for CentOS 7!

mmimica commented 1 year ago

The problem perists with 5.4.2-1.5.9 when using maven central artifact.

When I compiled the preset myself it worked fine, so it must have been something the way it was compiled.

import org.bytedeco.hyperscan.hs_database_t;

import static org.bytedeco.hyperscan.global.hyperscan.*;

public class Sample {
    public static void main(String[] args) {
        try (var db = new hs_database_t();
             var err = new hs_compile_error_t()) {
            var code = hs_compile("测", HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err);
            var errMsg = "";
            if (!err.isNull()) {
                errMsg = err.message().getString();
            }
            System.out.printf("Word: '%s'; retcode '%d'; message '%s' %n", "测", code, errMsg);
        }
    }
}

outputs Word: '测'; retcode '-4'; message 'Expression is not valid UTF-8.'

mmimica commented 1 year ago

More explicitly, hs_compile(new BytePointer("测".getBytes(StandardCharsets.UTF_8)), HS_FLAG_UTF8, HS_MODE_BLOCK, null, db, err) fails too. It is a valid 3-byte UTF-8 sequence btw.

saudet commented 1 year ago

This should be fixed now. Please give it a try with the snapshots: http://bytedeco.org/builds/

mmimica commented 1 year ago

Still fails.

saudet commented 1 year ago

I wonder which version of the compiler fixes that...

mmimica commented 1 year ago

So far I figured the bug occurs in Parser.cpp which is generated by Ragel. Centos 7 uses Development (!) version 7 of Ragel, while Ragel 6 seems to work fine. I haven't managed to install Ragel 6 on Centos 7.

mmimica commented 1 year ago

Ok, I found it. It is Ragel. Version 6 has to be used. I had it compiled manually because no package for Centos 7 exists. Also had to build cmake, bmake, libfsm and kelbt. What a fun day.

EDIT: Here is ragel 6.10 binary. It's self-contained. ragel.gz

saudet commented 1 year ago

Yes, I see, thanks. I forgot to update the builds to Ubuntu. Now the snapshots work. :+1: