Closed schmidtfederico closed 3 years ago
Hi, I'm afraid that's not the case, and I'm not sure it there's an easy way to do this (would have to think about it) — ICU is a heavy lifter - definitely not a easy-going header-only library.
Is there any ICU feature you're missing? It could be added to stringi (any PRs are welcome).
I'm also open towards discussing the idea of introducing some "internal" API that could be referred to from within C/C++ only.
Hi Marek!
Thank you for your (very) quick response.
I'm using some ICU features as part of a text processing pipeline that goes through both sentences and words at the same time. I'm mainly using UnicodeStrings & BreakIterators to find breaks in the text, as the rest of the pipeline works directly with wstrings.
I believe there isn't any ICU feature missing from stringi, I could easily use stri_locate_all_boundaries to find all sentence breaks and all word breaks and then iterate through both results in the same way I'm doing with BreakIterators right now.
The reason I was asking about accessing ICU directly was simply to avoid the overhead of going from a SEXP to a UTF32 string three times for each text I process (once at the start of my pipeline and twice to find the breaks). Additionally, this pipeline also makes use of RcppParallel to parallelize when processing multiple texts, so I'm not sure I'd be able to call stringi's internal API from multiple threads at the same time, since most calls are allocating memory using R's internal API.
In any case, I don't believe my use case is worth the effort of trying to extend the already mature stringi internal API to make room for this edge case. I'll continue linking to both stringi and ICU and if in the future I decide on publishing a package of my code, I may have to go through your configure.ac script to learn how to make it cross-platform.
Thanks once again for your insight! Feel free to close this question.
Yeah, there's no straightforward workaround I'm afraid.
But - is a double to-SEXP-from-SEXP conversion really a bottleneck? Did you check your algos with the profiler? Maybe it's just not worth the hassle? There might be nicer things in life to enjoy instead of engaging oneself in premature optimisation activities. 😊
Haha, wise words, indeed. Thanks for taking the time to follow up on this, I'll close the issue now. Have a great day and stay safe!
Hi @gagolews!
Thanks for the awesome stringi package and for this great example of how to link it to use it with Rcpp!
I'm building a Rcpp package that uses some ICU features and I was wondering if there's a way of accessing ICU headers by linking my package to stringi, as shown in this example.
As far as I can tell, only stringi's API is being exposed, but I was wondering if there's a way of extending this so that other devs can benefit from the great effort you put towards configuring ICU properly on every platform.
Thanks!