Closed xvxx closed 5 years ago
I've been playing with this a bit more and it's too aggressive in how it caches. Even though all the programs I've tested work, it could cause weird bugs if you change a string's multibyte content around but the overall size of the string doesn't change. Like if you're building a text editor or something where text gets shuffled around.
Also, it overcomplicates the generated C++ with <lots,of,funky,foo>
.
I'm going to try a different approach and update in a bit.
These are two little "optimizations" that, under some circumstances, make LDPL a lot faster when working with text.
I was trying to parse a big file using LDPL, one character at a time, but it was going kinda slow. So I started tracking down where time was spent, and it looked like it was mainly in two function:
join()
andutf8_substr()
.The first thing I did was run some benchmarks. It looked like
a += b
was a heck of a lot faster thanjoin(a, b, a)
, so I made the compiler special case any concat-style joins.The second one is kinda nutty: Every time you call
GET CHARACTER AT - FROM - IN
, the whole string has to be searched starting at the beginning and all the positions of all the characters have to be re-discovered. Because of the variable widths of characters in utf8. Kind of a necessary evil, since it's so nice that LDPL supports utf8! But it means if you callget character at 1000 from $filetext in var
thenget character at 1001 from $filetext in var
, and so on in a loop, each call is re-discovering the string and going through all 1000+ characters to find the one you asked for. It's like that movie Memento - everyone is always getting re-introduced! For big files, this can really get sluggish.So, I added a little
utf8cache
insidecharat()
that maps positions of characters to their utf8 code points the first time a search is done on a string. This means theget
in the loop basically becomes a constant time lookup, instead of O(whatever^n). There might still be some issues with how I compute the string id but it seems to work okay. All the tests pass and all the examples seem to work, as well as my projects.I started getting a benchmark ready of the slowness, but I had to bail after 30 minutes:
LDPL 3.0.4 -- 10 runs
So I ran it again with just 1 run and it took a minute:
LDPL 3.0.4 -- 1 run
LDPL w/ character cache
Now here's the benchmark with this patch, starting out with 10 runs:
So yes, much faster, but a crazy approach. And maybe it'll be slower for small strings? But it's made gild a lot faster, for one, which has been nice.
You can run the benchmark by cloning this gist and running
slow.ldpl
: https://gist.github.com/dvkt/443f5ddebe24e0fe406c051bee19cf3eAny other ideas for how to speed up the lookup? Seems like most people use a big utf8 library which I didn't want to bring it. The cache could be improved too though.