daveyarwood / djy

A library of character utility functions for Clojure
Eclipse Public License 1.0
37 stars 3 forks source link

Improve performance #2

Open daveyarwood opened 9 years ago

daveyarwood commented 9 years ago

This library's functions currently work for both BMP and supplementary characters by using a multimethod internally to determine whether the argument is a character, an integer or a string.

As Mikera brought up in the Google group discussion, this will probably cause significant performance issues, since we're relying on dynamic type inspection for each individual character processed.

Some careful thought is needed about how to make this library both easy to use and as fast as it needs to be for the performance-sensitive domains where it will likely be used (parsing, text analysis, data conversion).

Per Mikera, we would want to plan Clojure language changes to be aligned with this goal:

Examples of language changes that might help:

  • More analysis of argument types at call sites to use primitive functions / short-circuit multi-methods / eliminate type checks where possible
  • Suppression of the warnings that we currently get when replacing a core function (I'm thinking clojure.core/char here)
  • Compiler macros?

Ideas:

daveyarwood commented 9 years ago

See wiki for benchmarks... much improvement to be made.

solicode commented 9 years ago

I took a quick look at the source code, and I didn't see many potential gains to be had that you and Mikera haven't already identified.

I did notice one though. I would avoid using sets to check the existence of something when dealing with a small amount of items (punctuation?, mark?, symbol?, and separator? do this). For example:

(contains? #{Character/MATH_SYMBOL, Character/CURRENCY_SYMBOL,
             Character/MODIFIER_SYMBOL, Character/OTHER_SYMBOL}
  (Character/getType ^long (code-point-of ch)))

can be:

(let [t (Character/getType (code-point-of ch))]
  (or (= t Character/MATH_SYMBOL)
      (= t Character/CURRENCY_SYMBOL)
      (= t Character/MODIFIER_SYMBOL)
      (= t Character/OTHER_SYMBOL)))

I did a similar thing for a library I'm working on, and the performance improvement was about 100x. It's still not all that significant considering it's not a time consuming operation to begin with, but when dealing with tight loops which string processing often has, it could potentially add up.

I also have this macro that I'm using in my library:

(defmacro in? [coll val]
  (let [val-sym (gensym)]
    `(let [~val-sym ~val]
       (or ~@(for [x coll] `(= ~val-sym ~x))))))

So it's possible to change your contains? with in? and keep everything else the same.

Sorry, that's all I got. I can't find much else.

daveyarwood commented 9 years ago

Hey, thanks for taking a look! Every little bit of performance boost we can get will help. I'm using your in? macro now for v1.0.4.

solicode commented 9 years ago

So I was playing a bit with my experimental Japanese library, and I got some performance gains doing various things, but I'm not sure how helpful any of this will be to you. Sorry if it's not useful.

The first one is inlining (for my case, I got 2x by doing this). You see inlining used a lot in clojure.core, but the downside is that it's very ugly to write code in this style. It seems like you either have to duplicate logic, or use definline and let that attach the inline metadata for you. But even with definline, you still have to write your functions as if you're writing a macro, so there's that mental burden of quoting/unquoting and so on, which isn't a lot of fun.

I also got gains by creating non-seq versions of map/every?/some that have direct implementations for String. This worked especially well with the inlining, and I see a big improvement. It's dependent on the size of the input, but even with small strings, I'm seeing about a 10x improvement. But I'm not sure you want to add any String-specific functions in your library (being a char library and all).

And one last thing, regarding the implementation of char'. I guess it's a tradeoff: the off chance of catching an exception vs. using an if statement. It depends if you're optimizing for the average case or the worst case. The worst case being somebody dealing with a bunch of supplementary characters for input. For example, (char-range 135641 135740) throwing and catching 100 exceptions (if you realize the seq).

daveyarwood commented 9 years ago

Inlining is intriguing, and something that hadn't occurred to me. The exception-catching in char' is a good insight too. Would it actually be faster to use an if statement? We could write an if statement to check each number to see if it's > 65535 , which might have an added cost vs. using char and catching exceptions, but then again, number operations should be cheap, possibly cheaper than the cost of exception catching.

There might be something to the idea of writing string-specific versions of map/every?/some... even though this is a character library, the goal is to be able to process large strings of text quickly. So, it might make sense to have certain functions that expect strings as input and are optimized to deal with them more efficiently than a seq of characters.

If you're interested in hacking on this, pull requests are heartily accepted! Otherwise I'll give these ideas a shot the next time I get around to it.

(BTW, your Japanese library looks awesome!)

solicode commented 9 years ago

Right, the if statement is an added cost, and if you don't hit an exception, the try/catch approach should be slightly faster. But catching an exception is much more expensive than an if check. I believe it's on a completely different level. I don't have any numbers though, so I would have to confirm that and see just how big the difference is. It should be significant though.

It's also possible that with branch prediction, the if statement might not really factor in when you're dealing with strings with no supplementary characters. And when there are supplementary characters, it should be no contest. But again, it'll have to be confirmed. I don't mind looking into this. I'm curious what the results will be. :)

As for the other stuff, I'll see what I can do. I'm still very inexperienced when it comes to optimizing Clojure code, so I really have no clue what I'm doing. I'm learning as I go. I'll keep experimenting with my library, and if I find anything that could apply here too, I'll let you know or submit a PR.