gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
300 stars 45 forks source link

Building in ALT-REP to stringi #474

Open traversc opened 2 years ago

traversc commented 2 years ago

Have you considered implementing an ALT-REP string class? I think done properly, you'd see a large increase in performance across the board. There are many reasons why:

If there's interest, I'd be happy to develop and work on it.

To flesh it out a bit, I think you could use an ALT-REP class that's represented by standard STL structures:

std::vector<std::string>

You don't need to keep track of encoding, if you can assume UTF-8.

You'd probably want some global configuration parameter:

stri_use_alt_rep(bool)

You'd have to replace every interaction with R string memory with a conditional.

CHAR
SET_STRING_ELT
STRING_ELT
mkChar*
Rf_allocVector(STRSXP,...)

And replace any comparison of address for testing string equality (not sure if stringi does so).

There are probably things I'm forgetting and it's a lot of work, but I think clearly defined.

gagolews commented 2 years ago

I was actually thinking about giving stringi a major re-write for quite a long time. Now that the Windows-UCRT build of R assumes all strings are natively UTF-8, and the users are likely to transition to it in the future, it has started to make more sense.

And, yes, I am sure that ALTREP should be supported too, I could use your help in the future, thanks!

I took a look at your stringfish package – nice work! I was wondering what code did you use for generating the benchmarks summarised at https://github.com/traversc/stringfish/blob/master/vignettes/bench_v2.png ? I'd like to play with it a bit.

traversc commented 2 years ago

Here's the code for that plot: https://github.com/traversc/stringfish/blob/master/inst/extra_tests/benchmark_test.r

(I updated it a bit to make it easier to include stringi in the plot output)

Glad you're interested in ALTREP support. I'll fork stringi and add in ALT-REP to a few functions, as a proof of concept.

gagolews commented 2 years ago

Okay, nice, I'll be happy to take a look at the prototype

I see that stringfish uses PCRE. To be a bit fairer, I think you should be comparing the timings for gsub and friends with perl=TRUE set. Can you generate the benchmark results featuring a base vs stringi vs stringfish comparison and post them here? Thanks

traversc commented 2 years ago

Here's the plot setting perl = TRUE. Much less of a difference for regex substitution, but still lots of performance to be gained in various places. I believe the difference between the blue and yellow bars would be what you could expect to gain in stringi.

gagolews commented 2 years ago

So the most significant speed-up gain would be due to not using the R's CHARSXP cache? (for the 1-threaded version)

Another question: is your ALTREP-based stringfish framework compatible with read/saveRDS?

traversc commented 2 years ago

Another question: is your ALTREP-based stringfish framework compatible with read/saveRDS?

Yes, all ALTREP frameworks are inherently compatible wiht read/saveRDS.

Here's the prototype: https://github.com/traversc/stringi

Benchmark code: https://gist.github.com/traversc/f357c5f1a4b0368649849dd3d1f49d14

Dataset used for benchmark:

cd ~/Downloads
wget https://data.deepai.org/enwik8.zip
unzip enwik8.zip

Running the benchmark:

Rscript bench_stringi.R
# 2.616 sec elapsed

Rscript bench_stringi_with_alt_rep.R
# 1.412 sec elapsed

# stringi CRAN version, comment out stri_use_alt_rep(FALSE)
Rscript bench_stringi.R
# 2.562 sec elapsed

PS: I think it's important to run the benchmark in separate R sessions for a true comparison. I've found that even after proper garbage collection, there seems to be a big difference running a command multiple times in a row.

But even if you run the benchmark in the same R session, ALTREP is faster.

gagolews commented 2 years ago

Thanks for the working prototype. I will get back to you some time later this year when I hope I will be able to come up with a prototype of a re-write (probably something a'la stringi2 - that would be a separate project) of the package. I definitely do not want to make any major changes is the upstream version of stringi, where stability is of great importance.

traversc commented 2 years ago

Sounds good, looking forward to it!

eddelbuettel commented 1 year ago

Just coming in here waving :wave: -- @traversc and I were chatting about possible fast and lightweight (enough) containers for string vectors. I am currently working a bit arrow objects that have character vectors (in their encoding of contiguous vector plus a vector of offsets, ie c("the", "black", "cat") becomes "theblackcat" and c(0,3,8,11) ) which is quick to pass around -- and it would be nice to have something quick and light and powerful to work with it. Of course once we mod the vector the contiguous blob may no longer be viable but at least for read access before that a std::vector<string_view> should be easy and could be ALTREP backed. Anyway, food for thought. Happy to help, time permitting.

gagolews commented 1 year ago

Might be a good one!

My three Aussie cents:

Possible issue worth considering: R's string cache is bypassed... (can be a good thing)

Would be nice to agree on a common representation across many packages.

eddelbuettel commented 1 year ago

The c("the", "black", "cat") with c(0,3,8,11) ) is a given and cannot be changed. I do not know who started it but it seems common, it is how TileDB stores on disks / retrieves and also what arrow does. (There is one fine difference whether vector offsets is length n or n+1 as shown here. The latter is easier and preferred.

And no null termination anywhere :crying_cat_face: But that is what is out there and what would be most efficient to use.

Now, we could of course define another representation standard but that would start as an uphill battle with wind in our face :crying_cat_face:

traversc commented 1 year ago

Both @eddelbuettel and I have run into bottlenecks dealing with large amounts of string processing. Existing ALTREP frameworks (e.g. in vrooom, arrow, etc.) don't really help because materialization happens too often, e.g. whenever you use dplyr or data.table.

So an ALTREP "common representation across many packages" is very much the goal and would be huge for the R community :)

Figuring out the very best optimal representation does not need to be a bottleneck to getting started. We can hide the implementation behind a set of access and modify API functions and test out various implementations without too much work.

Like @eddelbuettel I would also have some time to help.

PS: I believe the Rstudio folks would also be interested (and hopefully supportive)

gagolews commented 1 year ago

What about doing this at the R, not just package level? Maybe we should ping Tomas Kalibera and ask what he thinks about it... I'm a big proponent of unity..

eddelbuettel commented 1 year ago

I always have Duncan Murdoch in the back of my ear: "if something can be done at the package level ..." It's just easier that way.

You raise a fair point. It may just make everything a tad harder to pull off.

gagolews commented 1 year ago

I agree.

We also need to think about how NA_character_s would be represented. NAs in the index vector (like c(0,3,NA,8,11))? But then it would slow down string extraction for "sparse" vectors (with many missing strings).

eddelbuettel commented 1 year ago

We probably have to do what Arrow and others do with is an extra (bitmap) vector to signal it. nanoarrow has helpers.

(It took me some time to come around to this as I actually truly madly deeply love how R has NA/NaA in ints and chars (and bools (!!) etc). But these days interop is likely as if not more important and if we want to do this for 'medium data at scale' we probably have to go with the times and have Arrow interop anyway.)

I'd have to double check but I think in the offsets vector it then simply repeat the last position implying length zero of the NA element. But when nullabillity is set (Arrow makes that optional) then there is an additional vector flagging this.

gagolews commented 1 year ago

A zero-length string with an additional NA-marker makes sense to me too. This will not paralyse other algorithms.

etiennebacher commented 7 months ago

Hello all, I don't have any knowledge in C/C++ or other low-level languages (just dabbling a bit in Rust) but I thought you might be interested in the string type in polars, since they rewrote it very recently and had important performance gains: https://pola.rs/posts/polars-string-type/

I hope this fits in this conversation, and sorry for the spam otherwise

gagolews commented 7 months ago

Thanks!