Lartu / ldpl

COBOL-like programming language that compiles to C++. With serious dinosaurs with neckties and briefcases 🦕💼
https://www.ldpl-lang.org/
Apache License 2.0
158 stars 24 forks source link

Speed up text character lookup #93

Closed xvxx closed 5 years ago

xvxx commented 5 years ago

These are two little "optimizations" that, under some circumstances, make LDPL a lot faster when working with text.

I was trying to parse a big file using LDPL, one character at a time, but it was going kinda slow. So I started tracking down where time was spent, and it looked like it was mainly in two function: join() and utf8_substr().

The first thing I did was run some benchmarks. It looked like a += b was a heck of a lot faster than join(a, b, a), so I made the compiler special case any concat-style joins.

The second one is kinda nutty: Every time you call GET CHARACTER AT - FROM - IN, the whole string has to be searched starting at the beginning and all the positions of all the characters have to be re-discovered. Because of the variable widths of characters in utf8. Kind of a necessary evil, since it's so nice that LDPL supports utf8! But it means if you call get character at 1000 from $filetext in var then get character at 1001 from $filetext in var, and so on in a loop, each call is re-discovering the string and going through all 1000+ characters to find the one you asked for. It's like that movie Memento - everyone is always getting re-introduced! For big files, this can really get sluggish.

So, I added a little utf8cache inside charat() that maps positions of characters to their utf8 code points the first time a search is done on a string. This means the get in the loop basically becomes a constant time lookup, instead of O(whatever^n). There might still be some issues with how I compute the string id but it seems to work okay. All the tests pass and all the examples seem to work, as well as my projects.

I started getting a benchmark ready of the slowness, but I had to bail after 30 minutes:

LDPL 3.0.4 -- 10 runs

$ time ./slow-bin 
*** STARTING TEST ***************************************************
> performing 10 runs
^C

real    30m29.856s
user    30m26.815s
sys 0m1.417s

So I ran it again with just 1 run and it took a minute:

LDPL 3.0.4 -- 1 run

$ ./slow-bin
*** STARTING TEST ***************************************************
> performing 1 runs
> looped 146720 times (1 runs of 146720 chars)
> took 67 seconds
*************************************************** TEST COMPLETE ***

LDPL w/ character cache

Now here's the benchmark with this patch, starting out with 10 runs:

$ ./slow-bin 10
*** STARTING TEST ***************************************************
> performing 10 runs
> looped 3408240 times (10 runs of 340824 chars)
> took 1 seconds
*************************************************** TEST COMPLETE ***

So yes, much faster, but a crazy approach. And maybe it'll be slower for small strings? But it's made gild a lot faster, for one, which has been nice.

You can run the benchmark by cloning this gist and running slow.ldpl: https://gist.github.com/dvkt/443f5ddebe24e0fe406c051bee19cf3e

Any other ideas for how to speed up the lookup? Seems like most people use a big utf8 library which I didn't want to bring it. The cache could be improved too though.

xvxx commented 5 years ago

I've been playing with this a bit more and it's too aggressive in how it caches. Even though all the programs I've tested work, it could cause weird bugs if you change a string's multibyte content around but the overall size of the string doesn't change. Like if you're building a text editor or something where text gets shuffled around.

Also, it overcomplicates the generated C++ with <lots,of,funky,foo>.

I'm going to try a different approach and update in a bit.