Poor performance? - Githubissues

mishoo commented 11 months ago

Hi folks,

I'm working on a toy project that involves encoding/decoding some binary data, and I happened to notice that using flexi-streams' in-memory streams is much much slower vs. reading/writing from a binary file. Just wondering if there is any reason for this weirdness. Here is a simple test function (it'll output like 50M into a file in /tmp/).

(ql:quickload '("trivial-utf-8" "flexi-streams"))

(defmacro mytime (title &body body)
  `(progn
     (format t "Timing: ~A~%" ,title)
     (time ,@body)))

(defun test-flexi-streams (&optional (filename "/tmp/flexistreams.bin"))
  (let* ((str "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789ăşţîâĂŞŢÎÂ")
         (size (trivial-utf-8:utf-8-byte-length str))
         (times 500000)
         (sequence))
    (mytime "write to built-in stream"
      (with-open-file (output filename :element-type '(unsigned-byte 8)
                                       :direction :output
                                       :if-does-not-exist :create
                                       :if-exists :supersede)
        (loop repeat times
              do (trivial-utf-8:write-utf-8-bytes str output))))

    (mytime "write to flexi in-memory stream"
      (let ((output (flexi-streams:make-in-memory-output-stream)))
        (loop repeat times
              do (trivial-utf-8:write-utf-8-bytes str output))
        (setf sequence (flexi-streams:get-output-stream-sequence output))))

    (mytime "read from built-in stream"
      (with-open-file (input filename :element-type '(unsigned-byte 8))
        (loop repeat times
              do (trivial-utf-8:read-utf-8-string input :byte-length size))))

    (mytime "read from in-memory stream"
      (let ((input (flexi-streams:make-in-memory-input-stream sequence)))
        (loop repeat times
              do (trivial-utf-8:read-utf-8-string input :byte-length size))))))

Results here look like this:

Timing: write to built-in stream
Evaluation took:
  0.336 seconds of real time
  0.334568 seconds of total run time (0.291399 user, 0.043169 system)
  99.70% CPU
  601,427,574 processor cycles
  0 bytes consed

Timing: write to flexi in-memory stream
Evaluation took:
  2.256 seconds of real time
  2.256795 seconds of total run time (2.256746 user, 0.000049 system)
  [ Real times consist of 0.003 seconds GC time, and 2.253 seconds non-GC time. ]
  [ Run times consist of 0.001 seconds GC time, and 2.256 seconds non-GC time. ]
  100.04% CPU
  4,058,053,884 processor cycles
  134,180,240 bytes consed

Timing: read from built-in stream
Evaluation took:
  0.566 seconds of real time
  0.567083 seconds of total run time (0.543763 user, 0.023320 system)
  [ Real times consist of 0.003 seconds GC time, and 0.563 seconds non-GC time. ]
  [ Run times consist of 0.001 seconds GC time, and 0.567 seconds non-GC time. ]
  100.18% CPU
  1,019,238,246 processor cycles
  448,008,624 bytes consed

Timing: read from in-memory stream
Evaluation took:
  3.040 seconds of real time
  3.033796 seconds of total run time (3.027243 user, 0.006553 system)
  [ Real times consist of 0.003 seconds GC time, and 3.037 seconds non-GC time. ]
  [ Run times consist of 0.001 seconds GC time, and 3.033 seconds non-GC time. ]
  99.80% CPU
  5,456,326,518 processor cycles
  447,991,600 bytes consed

I'm using SBCL 2.3.9.63-23c796775.

I'm not sure something can be done about it, but if it's possible to improve, I guess a lot of projects will benefit. In my case I'm forced to ditch flexi-streams for now; the difference is so big that if I have the data in memory, I'm better off dumping it into a file and read it back from there, rather than going through in-memory-input-stream.

phoe commented 11 months ago

This is likely because the methods for vector streams always operate element-by-element:

https://github.com/edicl/flexi-streams/blob/74a1027311371a57258eba1bc908e050f5702277/in-memory.lisp#L235-L252

https://github.com/edicl/flexi-streams/blob/74a1027311371a57258eba1bc908e050f5702277/in-memory.lisp#L262-L270

It should be possible to optimize these with techniques like (setf (subseq ...) (subseq ...)) for the case of a null transformer.

mishoo commented 11 months ago

@phoe that's clearly one case that can be optimized. But trivial-utf-8 only does ordinary write-byte and read-byte on the stream (it doesn't call read-sequence or write-sequence), so it's probably something else here..

phoe commented 11 months ago

Good catch - in that case, it might be generic dispatch overhead, since each byte means a single GF call. Are you required to use trivial-utf-8, or can you do something that can actually make use of write-sequence?

mishoo commented 11 months ago

trivial-utf-8 isn't a hard requirement, for this case I tried read-sequence then sb-ext:octets-to-string, and performance is marginally better, but still nowhere near the native stream. And it does more consing.

But in any case, I'm doing a lot of read-byte/write-byte on these streams, and 4 to 6 times slower than a native stream is plain unusable.. Somehow I want to believe the issue is something else than generic dispatch overhead. :)

phoe commented 11 months ago

Somehow I want to believe the issue is something else than generic dispatch overhead. :)

Profile this maybe? We'll see where most of the time is spent this way.

mishoo commented 11 months ago

Hmm.. just a quick (sb-profile:profile "FLEXI-STREAMS") (and reduced the number of repetitions from 500k to 10k in the function above), here's the result:

measuring PROFILE overhead..done
  seconds  |     gc     | consed |   calls   |  sec/call  |  name  
--------------------------------------------------------
     0.112 |      0.000 |      0 | 1,640,000 |   0.000000 | FLEXI-STREAMS::TRANSFORM-OCTET
     0.066 |      0.000 |      0 | 1,640,000 |   0.000000 | FLEXI-STREAMS::CHECK-IF-OPEN
     0.012 |      0.000 |      0 |   820,000 |   0.000000 | (SETF FLEXI-STREAMS::VECTOR-STREAM-INDEX)
     0.011 |      0.000 |      0 | 1,640,001 |   0.000000 | FLEXI-STREAMS::VECTOR-STREAM-VECTOR
     0.010 |      0.000 |      0 | 1,640,000 |   0.000000 | FLEXI-STREAMS::VECTOR-STREAM-INDEX
     0.007 |      0.000 |      0 | 1,640,000 |   0.000000 | FLEXI-STREAMS::IN-MEMORY-STREAM-TRANSFORMER
     0.002 |      0.000 |      0 |   820,000 |   0.000000 | FLEXI-STREAMS::VECTOR-STREAM-END
     0.000 |      0.000 | 32,736 |         1 |   0.000035 | FLEXI-STREAMS:GET-OUTPUT-STREAM-SEQUENCE
     0.000 |      0.000 |      0 |         1 |   0.000029 | FLEXI-STREAMS:MAKE-IN-MEMORY-INPUT-STREAM
     0.000 |      0.000 |      0 |         2 |   0.000007 | FLEXI-STREAMS::MAKE-OUTPUT-VECTOR
     0.000 |      0.000 |      0 |         1 |   0.000011 | (SETF FLEXI-STREAMS::VECTOR-STREAM-VECTOR)
     0.000 |      0.000 |      0 |         1 |   0.000002 | FLEXI-STREAMS:MAKE-IN-MEMORY-OUTPUT-STREAM
--------------------------------------------------------
     0.219 |      0.000 | 32,736 | 9,840,007 |            | Total

estimated total profiling overhead: 12.50 seconds
overhead estimation parameters:
  6.18e-9s/call, 1.2706499e-6s total profiling, 6.0573797e-7s internal profiling

So yeah, I thought that transformers might be a big issue.. Is there anyone actually using this feature?

phoe commented 11 months ago

With the current behavior, transform-octet loads the stream's transformer via in-memory-stream-transformer for every single byte, which perhaps can mitigate some multiprocessing problems (modifying a transformer mid-IO-operation?), but maybe this can only happen once per operation. Each such call incurs an extra GF dispatch.

This still won't fix the problem of needing to call it every time if you end up using write-byte rather than write-sequence, but it'll mean that writing a 1000-byte sequence will call transform-octet and in-memory-stream-transformer only once, rather than a 1000 times each.

mishoo commented 11 months ago

Perhaps a better way would be to have separate stream classes that support transformers? It's a pity that an optional functionality that not everyone needs slows down basic operations for everyone..

For my particular case I'll probably ditch streams completely, and cook my own helpers to work with a binary sequence (adjustable, for output). But I suspect a lot of projects would benefit from improving flexi-streams; someone should really look into it.

mishoo commented 11 months ago

I think I'll rest my case here, which unfortunately is bad news.. I wrote my own memory stream. It's the simplest, dumbest code possible, and it is significantly faster than flexi in-memory vector streams (write time: 1.660s vs 2.136s, read time: 2.253s vs 3.506s), but it's still nowhere near native stream performance (which even does actual I/O). Kinda defeats the purpose of using a memory stream.

I guess the sad, inescapable conclusion is that the slowness is inherent in the Gray Streams protocol itself (as @phoe suspected, generic dispatch overhead).

avodonosov commented 11 months ago

@mishoo, is this test really informative in a useful way?

Performance of some operation, like stream input or output, makes sense when compared with surrounding computation, or data processing. "What fraction of my application performance is spent in stream writing?" would be a more useful question, IMHO.

In this test there is nothing except invocations of the io functions. The single byte, or a small array, by reference, is passed through a series of function calls, and finally copied. Even in the function call frames this byte constitutes a marginal part.

It's like creating a dummy empty function with defun and a dummy empty method with defmethod, calling them millions of times and concluding that generic function call is 5 times slower than usual function call. This tells nothing about how much my application performance will change if I use the generic function.

Kinda defeats the purpose of using a memory stream.

What is the purpose of a memory stream? If you mean some buffering, then the streams SBCL creates with open, with-open-file, etc, are already buffered.

If native streams were unbuffered, with every IO operation resulting in a system call and physical storage device access (kernel disc cache disabled), then accumulating data in a memory stream first would result in a huge speedup, I believe.

mishoo commented 11 months ago

@avodonosov Story time :) In short, I'm working on a binary encoding to replace JSON. I believe JSON is overkill for data interchange between programs — think about how much time is spent encoding it on one side, and decoding it on the other.

I'm just testing out ideas. I think I have some pretty good ones. But if I want the world to consider reading beyond the title of my spec, I need to prove that it is indeed faster than JSON libraries such as jsown. In trying to do that, I concluded that it's Simply Just Not Possible™ to even come close to its performance using any kind of stream layered on top of trivial-gray-streams for reading data from memory (and many times the payload will already be in memory), so I had to roll my own functions for reading from/writing to an in-memory sequence of bytes. The end.

Symbolics commented 11 months ago

Is extending simple-streams (assuming you're OK with SBCL) an option for you?

On Saturday, October 14, 2023 at 04:01:07 PM GMT+8, Mihai Bazon ***@***.***> wrote:

@avodonosov Story time :) In short, I'm working on a binary encoding to replace JSON. I believe JSON is overkill for data interchange between programs — think about how much time is spent encoding it on one side, and decoding it on the other.

I'm just testing out ideas. I think I have some pretty good ones. But if I want the world to consider reading beyond the title of my spec, I need to prove that it is indeed faster than JSON libraries such as jsown. In trying to do that, I concluded that it's Simply Just Not Possible™ to even come close to its performance using any kind of stream layered on top of trivial-gray-streams for reading data from memory (and many times the payload will already be in memory), so I had to roll my own functions for reading from/writing to an in-memory sequence of bytes. The end.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

mishoo commented 11 months ago

@Symbolics yeah, SBCL is what I use, but I don't know about simple-streams. For now I'm happy with what I got — I won't be able to read/write directly using a Lisp stream, but it is already better than JSOWN.

I only filed this here because I initially thought FLEXI-STREAMS is the culprit for the slowness. I still believe it can be improved (e.g. by getting rid of the transformer), but a big part of it is due to dynamic dispatch, and that can't be fixed. :-\ Dunno, feel free to close the issue if you think it isn't worth the trouble.

BTW, this project isn't about Lisp. I hope this (or some similar) binary encoding becomes as ubiquitous as JSON, because we really need one. It just happens that CL is the language I like most to work with, so I'm writing my reference encoder/decoder in it; but I should write one in JS too, if I want anyone to take this seriously. :) Hope to publish something soon.

avodonosov commented 11 months ago

@mishoo, if jsown does not read from / to stream, probably doing it in your library wouldn't be a fair comparison. BTW, I noticed generic functions are used in jsown code, it's not free from generic functions.

You of course know about MessagePack, protobufs, BSON and everything else ?

Anyways, good luck with your approach.

avodonosov commented 11 months ago

An idea for how to use stream functions as the lowest common denominator for data access in a library, and still be fast.

If anyone knows a more performant dispatch approach than the Gray Streams prototocl, he could create his own package, say cl-streams2, with functions like read-byte, write-byte, read-sequence, write-sequence.

By default those functions would delegate all work to their counterparts from the cl package.

But also those functions could use the more performant dispatch for pluggable stream implementation, to invoke custom implementation if called with new custom stream.

That way, the code that performs all IO using the cl-stream2 package would work with standard CL streams, including gray streams, and also work with the new custom stream types that have minimal dispatch overhead.

And such a package could be a drop-in replacement for the standard cl stream functions, just use shadowing-import-from in the application package definition to employ them.

The only doubt: if people will repeat this trick many times, producing more and more packages like cl-stream3, my-stream, etc, the libraries and applications utilizing those packages, while working for the standard cl-streams, may also bloat the ecosystem by creating too many custom dispatch streams implementations.

mishoo commented 11 months ago

@mishoo, if jsown does not read from / to stream, probably doing it in your library wouldn't be a fair comparison. BTW, I noticed generic functions are used in jsown code, it's not free from generic functions.

Indeed, jsown reads from a string and does various funny tricks to achieve that performance.

You of course know about MessagePack, protobufs, BSON and everything else ?

I only learned about CBOR yesterday :)) Looks like I've been living under a rock. It renders my efforts pretty moot, since CBOR is a standard. Still, it was fun, I learned things :) and my encoding might not be entirely without merit so I'll get to publish it eventually. Obligatory XKCD :) https://xkcd.com/927/

Back to the original issue, I also noticed this years ago (my site is served by Hunchentoot). I can't unlearn the fact that it could be improved like 4x… I'll try it myself sometime.

edicl / flexi-streams

Poor performance? #50