apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Use ICU #81

Closed TinoDidriksen closed 3 years ago

TinoDidriksen commented 4 years ago

How about we switch all I/O and wide char use to ICU instead? That would get rid of all the locale irritations and make the code more portable.

We already require ICU, both directly and indirectly. We could even get rid of PCRE in downstream tools.

flammie commented 4 years ago

I think it's about good time to start use ICU, I feel like last we discussed this some 10 years ago, I was against on grounds that ICU is quite large by default and not so standardly installed or installable even, I think as we already have experience of using it, it's no longer an issue.

jonorthwash commented 4 years ago

Please please please can we do this? 🥺

ftyers commented 4 years ago

Is there a roadmap to fix this (e.g. do ICU) in C++ itself? e.g. in std ?

TinoDidriksen commented 4 years ago

Sort of. SG16: Unicode Direction is an overview of the work going into Standard C++ by SG16. The types are all there in C++20 (char8_t, char16_t, char32_t representing UTF-8/16/32), but none of the library is.

But they do conclude 2 important things: wchar_t is a portability deadend and In practice this means that we’ll need to ensure that proposals for new Unicode features are implementable using ICU.

So, wchar_t is bad and even Standard C++'s handling of Unicode would likely just forward to ICU.

But that library work is all slated for C++23 or later, which due to the 5 year lag means we can't widely use any of it until 2028 or 2031.

ftyers commented 4 years ago

Thanks @TinoDidriksen for the excellent overview. In that case I think that moving to ICU is probably the right thing to do, even if it's really ugly. I agree with @flammie that the situation now is very different to what it was 10 years ago.

nlhowell commented 4 years ago

Re: icu being ugly, I wrote a wrapper for it when we ported lexd, you might take a look at commit a5251bae0f935301ca9276e90c02e9f3262b9c0d for the port. The wrapper provides a C++ iterator interface, instead of icu's C-like iterator.

It's not complete, but it has worked pretty well for us.