bytedeco / javacpp-presets

The missing Java distribution of native C++ libraries
Other
2.68k stars 743 forks source link

Libpostal unable to expand certain addresses #954

Open mrog opened 4 years ago

mrog commented 4 years ago

Libpostal hangs when it's asked to expand certain addresses from Java. It might be related to the use of Unicode characters in JNI. The same addresses work when the libpostal command line executable is used.

Repro steps

Run this Java code using libpostal-platform version 1.1-alpha-1.5.4.

libpostal_setup_datadir(libpostalDataDir);
libpostal_setup_parser_datadir(libpostalDataDir);
libpostal_setup_language_classifier_datadir(libpostalDataDir);
String address = "ПРОСПЕКТ КУЛЬТУРЫ";
BytePointer addressToExpand = new BytePointer(address, StandardCharsets.UTF_8);
libpostal_normalize_options_t normalizeOptions = libpostal_get_default_options();
SizeTPointer sizeTPointer = new SizeTPointer(0);
PointerPointer expansionResult = libpostal_expand_address(addressToExpand, normalizeOptions, sizeTPointer);

Expected result

The code should complete.

Actual result

The call to libpostal_expand_address never returns.

Command line result

This same address can be expanded using the command line executable for libpostal.

sh-3.2$ ./libpostal 'ПРОСПЕКТ КУЛЬТУРЫ'
проспект культуры
prospekt kultury
saudet commented 4 years ago

We probably need to encode the strings with Charset.defaultCharset(), and not UTF-8, for this to work.

mrog commented 4 years ago

I tried using Charset.defaultCharset() and it still doesn't work.

I'm using OpenJDK 14.0.2 on macOS 10.15.6.

mrog commented 4 years ago

Interestingly, I get this message when I try UTF-16:

WARN  invalid UTF-8
   at transliterate (transliterate.c:791) errno: No such file or directory

Then I tried ISO-8859-1 and it worked! That solution also worked on Ubuntu. I don't have a Windows machine to test it on.

mrog commented 4 years ago

I spoke too soon. I found a different string that only works if I choose UTF-8.

Lituânia

If I select ISO-8559-1, then this string causes the library to display the invalid UTF-8 message and then freeze.

mrog commented 4 years ago

To avoid any ambiguity, here are the two strings with the Unicode characters escaped.

String stringThatRequiresChoosingUtf8 = "Litu\u00E2nia";
String stringThatRequiresChoosingIso88591 = "\u041F\u0420\u041E\u0421\u041F\u0415\u041A\u0422 \u041A\u0423\u041B\u042C\u0422\u0423\u0420\u042B";
saudet commented 4 years ago

I suppose the encoding that we need to use depends on the model used. That information must be available somewhere.

saudet commented 4 years ago

@Maurice-Betzel Would you know?