armedbear / abcl

Armed Bear Common Lisp <git+https://github.com/armedbear/abcl/> <--> <svn+https://abcl.org/svn> Bridge

https://abcl.org#rdfs:seeAlso<https://gitlab.common-lisp.net/abcl/abcl>

Other

292 stars 30 forks source link

CLOS effective method cache doesn't work on specializers which are subtypes of `mop:specializer` #587

Closed PiotrWasik closed 1 year ago

PiotrWasik commented 1 year ago

abcl 1.9.1 performance - regexp operations 8 times slower than in abcl 1.8.0

To replicate (on Windows),

Have JDK17 (not tried others, the exact version does not seem to matter, I have the same issue at work computer), here I used OpenJDK 64-Bit Server VM Temurin-17+35 (build 17+35, mixed mode, sharing)
create the following directories: C:\dev\home where you will run cmd.exe C:\dev\usr\abcl-bin-1.8.0 (with abcl.jar in it) C:\dev\usr\abcl-bin-1.9.1 (with abcl.jar in it)
unzip sources to C:\dev\usr\abcl-bin-1.9.1\abcl-1.9.1-sources - they are used as content to scan for regexps.
put performance-testing.lisp, run-abcl-1.8.0.bat and run-abcl-1.9.1.bat from the attachments into c:/dev/home
put the .abclrc into your c:\Users\yourusername, it just initialises quicklisp

now cmd.exe, cd c:\dev\home run-abcl-1.9.1.bat (load "c:/dev/home/performance-testing") (quit) ;; this was just to get all dependencies downloaded, not sure why, but I had an impression I could not replicate it in the first run when I was downloading dependencies from quicklisp

clear your quicklisp compilation cache, in my case C:\Users\piotr\AppData\Local\cache\common-lisp

then again run-abcl-1.9.1.bat (load "c:/dev/home/performance-testing") (show-runtime) 164.31 seconds real time 245270175 cons cells NIL (quit)

run-abcl-1.8.0.bat (load "c:/dev/home/performance-testing") (show-runtime) 18.858 seconds real time 37650729 cons cells NIL (quit)

The content of the performance-testing.lisp file

(ql:quickload :cl-fad)
(ql:quickload :cl-ppcre)
(ql:quickload :arnesi)

(defun load-all-source-files ()
  (let (content-of-files)
    (cl-fad:walk-directory
     "c:/dev/usr/abcl-bin-1.9.1/abcl-1.9.1-sources/"
     (lambda (filename)
       (push (arnesi:read-string-from-file filename) content-of-files)))
    content-of-files))

(defun count-lines-with-regex (content)
  (length (remove-if-not (lambda (line) (ppcre:scan "defun|void" line)) (ppcre:split "\\n" content))))

(defun show-runtime ()
  (let ((content-of-files (load-all-source-files)))
    (time (mapcar #'count-lines-with-regex content-of-files)))
  nil)

testing-log.txt my-abclrc.txt run-abcl-1.8.0.bat.txt run-abcl-1.9.1.bat.txt performance-testing.lisp.txt

easye commented 1 year ago

I have confirmed that there seems to be a slowdown:

CL-USER> (show-runtime)
5.024 seconds real time
29878199 cons cells
NIL
CL-USER> (lisp-implementation-version)
"1.9.2-dev"
"OpenJDK_64-Bit_Server_VM-MacPorts-17.0.7+7"
"aarch64-Mac_OS_X-13.2.1"

CL-USER(2): (show-runtime)
1.803 seconds real time
2399949 cons cells
NIL
CL-USER(3): (lisp-implementation-version)
"1.8.0"
"OpenJDK_64-Bit_Server_VM-MacPorts-17.0.7+7"
"aarch64-Mac_OS_X-13.2.1"

Still not quite sure what is being tested, as there is a fair amount of i/o going on as well as in-memory string creation, not just regexp matching. Performing some basic profiling between 1.8.0 and 1.9.1 would be the next step to figure out what might be causing this.

PiotrWasik commented 1 year ago

Thanks for checking this. I will try to assist. Your slowdown is not as radical as mine. I/O is already excluded. in show-runtime we are only timing this part: (mapcar #'count-lines-with-regex content-of-files) which I agree does too much

I added the following

(defun show-runtime-2 ()
  (let ((content-of-files (load-all-source-files))
    (scanner (ppcre:create-scanner "\\n")))
    (time (mapcar (lambda (content) (ppcre:split scanner content)) content-of-files)))
  nil)

(defun show-runtime-3 ()
  (let ((content-of-files (load-all-source-files))
    (scanner (ppcre:create-scanner "defun|void"))
    (all-lines))
    (dolist (content content-of-files)
      (dolist (line (ppcre:split "\\n" content))
    (push line all-lines)))
    (time (remove-if-not (lambda (line) (ppcre:scan scanner line)) all-lines)))
  nil)

and now the tests are like this, with (show-runtime) being like yesterday

CL-USER(2): (lisp-implementation-version)
"1.8.0"
"OpenJDK_64-Bit_Server_VM-Eclipse_Adoptium-17+35"
"amd64-Windows_10-10.0"
CL-USER(3): (show-runtime)
18.98 seconds real time
37743479 cons cells
NIL
CL-USER(4): (show-runtime-2)
0.373 seconds real time
1764204 cons cells
NIL
CL-USER(5): (show-runtime-3)
2.803 seconds real time
296340 cons cells
NIL

CL-USER(2): (lisp-implementation-version)
"1.9.1"
"OpenJDK_64-Bit_Server_VM-Eclipse_Adoptium-17+35"
"amd64-Windows_10-10.0"
CL-USER(3): (show-runtime)
165.322 seconds real time
245877991 cons cells
NIL
CL-USER(4): (show-runtime-2)
4.64 seconds real time
14110440 cons cells
NIL
CL-USER(5): (show-runtime-3)
9.084 seconds real time
11575608 cons cells
NIL
CL-USER(6):

All 3 cause slowdown. split certainly creates new strings, scan - I don't think so (but it is complex internally).

CL-USER(8): (ppcre:scan "scan" "scan")
0
4
#()
#()

So I don't know what to check next, any suggestions please? Also note all 3 tests cons more in 1.9.1. How to force creating thousands of unique strings? I tried this:

CL-USER(6): (time (dotimes (i 1000000) (arnesi:random-string)))
2.243 seconds real time
3 cons cells
NIL
CL-USER(7): (lisp-implementation-version)
"1.8.0"
"OpenJDK_64-Bit_Server_VM-Eclipse_Adoptium-17+35"
"amd64-Windows_10-10.0"

CL-USER(10): (time (dotimes (i 1000000) (arnesi:random-string)))
4.092 seconds real time
3 cons cells
NIL
CL-USER(11): (lisp-implementation-version)
"1.9.1"
"OpenJDK_64-Bit_Server_VM-Eclipse_Adoptium-17+35"
"amd64-Windows_10-10.0"

1.9.1 is slower again but they both don't cons.

PiotrWasik commented 1 year ago

okay, it does not look like it is string allocation; this example does not allocate any string I think, the (time ...) body is only ppcre:do-scans and it is 13 times slower in 1.9.1 vs 1.8.0.

the performance-testing.lisp file

(ql:quickload :cl-fad)
(ql:quickload :cl-ppcre)
(ql:quickload :arnesi)

(defun load-all-source-files ()
  (let (content-of-files)
    (cl-fad:walk-directory
     "c:/dev/usr/abcl-bin-1.9.1/abcl-1.9.1-sources/"
     (lambda (filename)
       (push (arnesi:read-string-from-file filename) content-of-files)))
    content-of-files))

(defun show-runtime-10 ()
  (let ((content-of-files (load-all-source-files)))
    (time
     (dolist (content content-of-files)
      (cl-ppcre:do-scans (match-start match-end rs re "\\n" content))))))

(princ "test Version 10")
(princ (multiple-value-list (lisp-implementation-version)))

then when I (load "performance-testing"), I get these times

test Version 10(1.8.0 OpenJDK_64-Bit_Server_VM-Eclipse_Adoptium-17+35 amd64-Windows_10-10.0)
T
CL-USER(2): (show-runtime-10)
7.654 seconds real time
27784928 cons cells
NIL

test Version 10(1.9.1 OpenJDK_64-Bit_Server_VM-Eclipse_Adoptium-17+35 amd64-Windows_10-10.0)
T
CL-USER(2): (show-runtime-10)
100.976 seconds real time
158596124 cons cells
NIL

I tried (subseq ...) to split contents of files myself from precalculated positions of newlines, but its performance is very similar between both implementations.

I tried to look inside ppcre library and I it looks that for scanning it first parses "\n" into #\newline and passes it as (list :group #\newline) to the next function in https://github.com/edicl/cl-ppcre/blob/master/api.lisp line 68, then line 111 (convert parse-tree) will return an object of type str as its primary value so it will be bound to regex. it looks like then line 144 (create-matcher-aux regex #'identity) will jump to https://github.com/edicl/cl-ppcre/blob/master/closures.lisp line 192.... it is a complex library, is there a way to analyse what is happening without dissecting cl-ppcre? maybe a java profiler can help? what should I check next please?

easye commented 1 year ago

[…] it is a complex library, is there a way to analyse what is happening without dissecting cl-ppcre? maybe a java profiler can help? what should I check next please?

(Un)surprisingly, I am not particularly knowledgeable in profiling ABCL. Partly this stems from my general development approach in the course of accompanying the Armed Bear on its journey to ANSI conformance to prefer correctness over efficiency. Usually, unless something is really performing badly, I don't even really attempt to measure or gauge performance differences: as long as the implementation gets "more correct", I take the patch. Oddly, even after twelve years of reaching ANSI conformance, the majority of issues in ABCL involve fixing compatibility or correctness, in a (hopefully) ever asymptotically approaching manner.

It would be very useful to have more informational resources towards profiling ABCL. Here is an initial stab of what I know. We should collect these and (hopefully) subsequent note into some aspect of ABCL developer resources (perhaps under https://github.com/armedbear/abcl/tree/master/tools).

Bisect the offending change

If one has a good test that shows the problem, since ABCL doesn't actually get too many patches it might be easiest to use a Git bisect to find the exact commit which causes the performance degradation. The code in https://github.com/armedbear/abcl/blob/master/tools/check.lisp was written for the Mercurial bisect command but should be able to be ported trivially.

ABCL currently has the problem of not having tags in git for the various branches https://github.com/armedbear/abcl/issues/370. I usually use the release dates listed at https://abcl.org/trac/ to find these ranges.

Java Profiling with Netbeans

Since ABCL runs on the JVM, the "actual" performance is of course how that JVM is performing wrt. to its hosting operating system under the given task. There are many JVM profiling tools, but I only really know anything about the Netbeans IDE profiler https://netbeans.apache.org/kb/docs/java/profiler-intro.html. Netbeans might well be an minority choice as industry-standard practice in 2023 is probably to use either Oracle Java Flight Recorder or plugins for IntelliJ's IDE. Netbeans has the virtues of a) being free as in libre and money and b) being able to use the ABCL source tree as a Netbeans project due to the artifacts in https://github.com/armedbear/abcl/blob/master/nbproject/.

Since this instrumentation is happening in the JVM hosting ABCL, such profiling is the least likely to "disturb" whatever performance one is trying to measure.

Common Lisp Profiling

https://github.com/armedbear/abcl/blob/master/src/org/armedbear/lisp/profiler.lisp contains a simple Common Lisp based profiler that measures call and hot counts for method executions. There also seems to be support for recording call times, but I have never used this machinery.

--

Hope that helps. I will try to find time to look at this in Netbeans, but all my current ABCL time is being spent fixing the ANSI/Gray streams problems with the pretty printer to enable SLIME to work again https://github.com/armedbear/abcl/pull/586.

PiotrWasik commented 1 year ago

Thanks!

All 3 suggestions are helpful. I will probably start with bisecting commit history, and slowly make my way through it. My current hunch is that there are forms that incorrectly fall back to interpreted mode, I remember I saw improvements announcement around this area (possibly to overcome jvm opcodes limit per class). I don't get why Windows seems to be affected most though. It will take some time, so I will not be spamming this thread until I find something useful.

I totally agree with your priorities putting correctness and specs conformance first. ABCL is incredibly useful to navigate corporate applications that are mostly written in Java. The fact that I can download libraries from quicklisp and they almost always work is a testimony to how compliant ABCL is, and precisely this hassle free experience makes ABCL so useful. However because I reverted to 1.8.0 when I discovered real world tasks take over 400 seconds in 1.9.1 where they took say 120 seconds in 1.8.0, to avoid being stuck in 1.8.0 forever, I will keep looking into this issue. So thanks again for looking after the Bear :)

PiotrWasik commented 1 year ago

I believe this is the offending commit: d0a8264e 2023-02-17 Use ATOMIC-DEFGENERIC for MOP:SPECIALIZER discrimination. I takes 104.121 secs to execute and the previous commit e2d74c65 : 2023-02-09 : DEFMETHOD accepts the subtype of MOP:SPECIALIZER as a specializer takes 11.008 secs to execute.

I don't understand it but maybe your declaim has typo?

(declaim (notinline add-direct-ubclass remove-direct-subclass))

I generated commit.txt by executing

git log  --date=short --pretty=format:"%h : %ad : %s" > ../home/commits.txt

commits.txt

This file is for ABCL being tested. performance-testing.lisp.txt

I wrote bisect.lisp that is inspired by your check.lisp, it has to be run under something like sbcl, because it clears cache by removing every directory that starts with abcl-. It has a bug so it never stopped but it generated enough to pinpoint the commit. bisect.lisp.txt

in sbcl you execute:

(run-loop)

it creates the new slow run, the old fast run and starts bisecting.

These files are autogenerated: performance-log.txt performance-good-bad.lisp.txt

The performance-log.txt shows performance per commit, performance-good-bad.lisp shows pairs of currently discovered newest "good" and oldest "bad" commits - you can see it started looping at commit positions 39 and 40, so it should have stopped, but we probably pinpointed the issue.

easye commented 1 year ago

I don't understand it but maybe your declaim has typo?
(declaim (notinline add-direct-ubclass remove-direct-subclass))

This is definitely a typo.

Working on trying to fix this for abcl-1.9.2 <https://github.com/armedbear/abcl/issues/594>

easye commented 1 year ago

Confirmed that changes for <https://github.com/armedbear/abcl/issues/539> in allowing defmethod to specialize on subtypes of mop:specializer degrades performance.

The performance degradation is due to disabling the use effective method function ("emf") cache <https://github.com/armedbear/abcl/blob/master/src/org/armedbear/lisp/clos.lisp#L2576>.

To fix this we need to overhaul the cache <https://github.com/armedbear/abcl/blob/master/src/org/armedbear/lisp/EMFCache.java#L161>.

Frankly, I need to review the EMF Cache structure, and our MOP implementation to get a handle on this, so I need a little time here.

@alanruttenberg From what you understand, with a suitable amount of elbow grease, would it be possible to extend the Java-side caching here to cover mop:specializer subtypes, or is there something fundamental here that we can't cache?

alanruttenberg commented 1 year ago

I'm going to have to dig deeper to understand what is going on. Not sure I'll be able to until week after next.

easye commented 1 year ago

After re-reading Rhodes and Newton 2008, it seems that it would greatly improve our user experience of implementing discrimination based on mop:specializer subclasses by adding the following three generic functions to the MOP.

(defgeneric mop:parse-specializer-using-class (generic-function specializer-name)
  (:documentation
   "This generic function returns an instance of MOP:SPECIALIZER,
  representing the specializer named by SPECIALIZER-NAME in the
  context of GENERIC-FUNCTION")
  (:method ((gf standard-generic-function) (name t))
    (warn "unimplemented mop:parse-specializer-using-class")))

(defgeneric mop:unparse-specializer-using-class (generic-function specializer)
  (:documentation
   "This generic function returns the name of SPECIALIZER for generic
   functions with class the same as GENERIC-FUNCTION")
  (:method ((gf standard-generic-function) (name mop:specializer))
    (warn "unimplemented mop:unparse-specializer-using-class")))

(defgeneric mop:make-method-specializers-form (generic-function method)
  (:documentation
   "This function is called with (maybe uninitialized, as with the
   analogous arguments to mop:make-method-lambda) generic-function and
   method, and a list of specializer names (being the parameter
   specializer names from a defmethod form, or the symbol t if
   unsupplied), and returns a form that evaluates to a list of
   specializer objects in the lexical environment of the defmethod
   form.")
  (:method ((gf generic-function) (method standard-method))
    (warn "unimplemented mop:make-method-specializers-form")))

Their implementation obviates the need for requiring the reader macro as tested in @bohonghuang's motivating code <<file:t/mop-specializer>>

alanruttenberg commented 1 year ago

Could you try your tests with the following change:

diff --git a/src/org/armedbear/lisp/clos.lisp b/src/org/armedbear/lisp/clos.lisp
index 9b33cc08..b0fc9944 100644
--- a/src/org/armedbear/lisp/clos.lisp
+++ b/src/org/armedbear/lisp/clos.lisp
@@ -2573,7 +2573,7 @@ to ~S with argument list ~S."
                   (wrap-emfun-for-keyword-args-check gf emfun non-keyword-args
                                                      applicable-keywords)))
           ;;; Cache only understand classes and eql specializers
-          (when (eq (class-of gf) 'standard-generic-function)
+          (when (typep gf 'standard-generic-function)
             (cache-emf gf args emfun))
           (funcall emfun args))
         (apply #'no-applicable-method gf args))))

I believe that my previous check was wrong, causing there to be nothing added to the emf-cache.

alanruttenberg commented 1 year ago

This test, however, is wrong I realize. Bleh. I don't think I want to mess with the Java EMF code, so I want to exclude caching methods whose arguments are specializers other than eql. That test won't do it. To be continued. At least the test will verify that that line is the bottleneck.

alanruttenberg commented 1 year ago

Ok, this is a better check. It checks if any of the applicable method have a specializer other than eql-specializer or class and if so doesn't use the cache. This further slows down slow-method-lookup, but I haven't quantified how much. Clearly it would be better to extend the emf-cache to handle any specializer. I'll look at the java code again.

diff --git a/src/org/armedbear/lisp/clos.lisp b/src/org/armedbear/lisp/clos.lisp
index 9b33cc08..1af50659 100644
--- a/src/org/armedbear/lisp/clos.lisp
+++ b/src/org/armedbear/lisp/clos.lisp
@@ -2572,8 +2572,15 @@ to ~S with argument list ~S."
             (setf emfun
                   (wrap-emfun-for-keyword-args-check gf emfun non-keyword-args
                                                      applicable-keywords)))
-          ;;; Cache only understand classes and eql specializers
-          (when (eq (class-of gf) 'standard-generic-function)
+          ;; Cache only understand classes and eql specializers. Check
+          ;; the applicable methods and if any have a specializer that
+          ;; isn't an eql-specializer or class. don't cache
+          (unless (some (lambda(m) 
+                          (some 
+                           (lambda(x)
+                             (and (typep x 'specializer) (not (typep x 'eql-specializer)) (not (typep x 'class))))
+                           (method-specializers m)))
+                        applicable-methods)
             (cache-emf gf args emfun))
           (funcall emfun args))
         (apply #'no-applicable-method gf args))))

alanruttenberg commented 1 year ago

Ok, I've read the java. My understanding is that it can not handle arbitrary specializers. The cache key is list of types of arguments and a special eqlspecializer object. To compute the key when caching the args are traversed and first the arg is checked to see if it has ever been used as a specializer. If so it's considered an eql specializer otherwise the class is used. The list of specializers/classes is the key that will be looked up.

This leads to a surprising situation in which a given method might have several cache entries, some with an eql specializer in a position that was not eql specialized. Comment from the code is below:

   * <p>Consider:
   * <pre><tt>
   * (defgeneric f (a b))
   *
   * (defmethod f (a (b (eql 'symbol)))
   *   "T (EQL 'SYMBOL)")
   *
   * (defmethod f ((a symbol) (b (eql 'symbol)))
   *   "SYMBOL (EQL 'SYMBOL)")
   *
   * (f 12 'symbol)
   * => "T (EQL 'SYMBOL)"
   *
   * (f 'twelve 'symbol)
   * => "SYMBOL (EQL 'SYMBOL)"
   *
   * (f 'symbol 'symbol)
   * => "SYMBOL (EQL 'SYMBOL)"
   *
   * </tt></pre>
   *
   * After the two above calls <tt>cache</tt> will contain three keys:
   * <pre>
   * { class FIXNUM, EqlSpecialization('SYMBOL) }
   * { class SYMBOL, EqlSpecialization('SYMBOL) }
   * { EqlSpecialization('SYMBOL), EqlSpecialization('SYMBOL) }.
   * </pre>

With some other kind of specializer you would need to figure out what to use as hash key for arguments in that position. Whether there is any useful hash to compute really depends on the type of specializer and that knowledge would have to be exposed somehow to the cache functions.

So my current thinking is that the best we an do is the check in https://github.com/armedbear/abcl/issues/587#issuecomment-1597840833 .

Everything I'm saying should be checked, of course...

alanruttenberg commented 1 year ago

As another example suppose you have

(defmethod foo ((a number)) (print 'foo))

If you call (foo 1) you get 1 EMF entry (key = fixnum)
If you call (foo 1.5) you get another EMF entry (key = single-float)
If you call (foo 1/2) you get another EMF entry (key = ratio)

The multiple entries are so that on retrieval the cache lookup just as to look at the list of classes of the arguments to get a key used to look up the method.

So bottom line the EMF tries to do as little work as possible when doing a lookup, trading that ability for extra space in the cache and potentially multiple calls (one for each unique list of argument classes) to the slow dispatch routine and corresponding multiple entries in the cache.

A custom specializer can do arbitrary work to determine whether an argument falls under the specializer, work that has no place to be done in the cache code and even if you did so it's not clear that one could come up with something that can be part of a hash key.

easye commented 1 year ago

So bottom line the EMF tries to do as little work as possible when doing a lookup, trading that ability for extra space in the cache and potentially multiple calls (one for each unique list of argument classes) to the slow dispatch routine and corresponding multiple entries in the cache.

I don't think space is much of an issue in our EMFCache implementation at this point: avoiding the slow dispatch routines seems much more important.

A custom specializer can do arbitrary work to determine whether an argument falls under the specializer, work that has no place to be done in the cache code and even if you did so it's not clear that one could come up with something that can be part of a hash key.

I feared that the arbitrary behavior from custom specializer precludes a caching strategy, so thanks for providing more confirmation for that hunch.

Your refined test in <https://github.com/armedbear/abcl/issues/587#issuecomment-1597840833> makes more sense as well now. I will try applying that to see if the speed loss in cl-ppcre goes away. Thanks!

easye commented 1 year ago

I will try applying that to see if the speed loss in cl-ppcre goes away.

With the refined test for using the cache, the speed loss drops considerably under an order of a magnitude difference: we are still somewhat slower than abcl-1.8.0, but within an acceptable value to consider this issue addressed, at least for the time being.

If anyone has an idea how to come up with a better caching strategy we could implement, please pipe up.