`examples/server` errors when run with more than one thread

facil-io / cstl

facil.io C STL - a Server Toolbox Library for C, including JSON processing, hash maps, dynamic arrays, binary strings and more.

https://facil.io

MIT License

68 stars 6 forks source link

`examples/server` errors when run with more than one thread #17

Closed michaellenaghan closed 11 months ago

michaellenaghan commented 1 year ago

Known issue?

Running this:

$ make examples/server
$ ./tmp/server -w 1 -t 2

and, in another session, this:

wrk -c 125 -d 30s -t 8 --latency http://localhost:3000

produces the errors shown below.

Known issue, or worth investigating?

===

$ ./tmp/server -w 1 -t 2
INFO:     
    Starting HTTP echo server example app.
    Engine: kqueue
    Workers: 1  (cluster mode)
    Threads: 1+2    (per worker)
    Press ^C to exit.
INFO:     (29772) spawning 1 workers.
INFO:     (29773) worker starting up.
INFO:     (29773) started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
INFO:     (29773) stopped listening @ 0.0.0.0:3000
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
FATAL:    (29773) lost connection with manager process, shutting down!
ERROR:    -1 leaks detected for fio
WARNING:  (fio_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fio_malloc) leaked block(s) for chunk 0x10a040100
ERROR:    (fio_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
ERROR:    5 leaks detected for fio_bstr_s
WARNING:  (fiobj_mem_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fiobj_mem_malloc) leaked block(s) for chunk 0x10ac40078
ERROR:    (fiobj_mem_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
WARNING:  abnormal worker exit detected
INFO:     (29775) worker starting up.
INFO:     (29775) started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
INFO:     (29775) stopped listening @ 0.0.0.0:3000
FATAL:    (29775) lost connection with manager process, shutting down!
ERROR:    2 leaks detected for fio
ERROR:    1 leaks detected for fio_stream_packet_s
WARNING:  (fio_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fio_malloc) leaked block(s) for chunk 0x10a040100
ERROR:    (fio_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
ERROR:    6 leaks detected for fio_bstr_s
WARNING:  (fiobj_mem_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fiobj_mem_malloc) leaked block(s) for chunk 0x10ac40078
ERROR:    (fiobj_mem_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
WARNING:  abnormal worker exit detected
INFO:     (29776) worker starting up.
INFO:     (29776) started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out
...

michaellenaghan commented 1 year ago

Stripping the echo code down to just this still produces errors:

...
#if HTTP_RESPONSE_ECHO /* write request to string to be sent as response */
  char *out = fio_bstr_write2(
      NULL,
      FIO_STRING_WRITE_STR2(" ", 1),
      FIO_STRING_WRITE_STR2(" ", 1),
      FIO_STRING_WRITE_STR2("\r\n", 2));
  fio_http_write(h,
                 .buf = out,
                 .len = fio_bstr_len(out),
                 .dealloc = (void (*)(void *))fio_bstr_free,
                 .copy = 0,
                 .finish = 1);
#else
...

michaellenaghan commented 1 year ago

Actually, this is enough to cause errors with more than one thread:

static void http_respond(fio_http_s *h) {
  fio_http_write(h, .buf = "Hello World!", .len = 12, .finish = 1);
}

I initially thought that non-echo didn't error, but that was wrong.

The problem is that I built non-echo like this:

make examples/server FLAGS="HTTP_RESPONSE_ECHO=0"

and that inadvertently overrode this line in the Makefile:

FLAGS:=FIO_LEAK_COUNTER

So to see the errors in non-echo you have to build it like this instead:

make examples/server FLAGS="FIO_LEAK_COUNTER HTTP_RESPONSE_ECHO=0"

boazsegev commented 1 year ago

Hi @michaellenaghan ,

Thanks for opening this issue!

Does this happen also on the pubs_clustering branch (fixes a number of issues + adds encryption to pub/sub in preparation for clustering)?

I cannot reproduce on my Mac... May have already been fixed. I'll open Docker and switch branches to revisit the issue soon. I gotta catch a flight.

Cheers, B.

michaellenaghan commented 1 year ago

Does this happen also on the pubs_clustering branch (fixes a number of issues + adds encryption to pub/sub in preparation for clustering)?

Yes, it does. Steps to reproduce:

$ make examples/server
<ctrl-c>
$ ./tmp/server -w 1 -t 2

and then, in another session:

$ wrk -c 125 -d 30s -t 6 --latency http://localhost:3000

You need concurrent load to trigger the errors.

===

As I've said many times, I'm very new to 0.8.0. Based on my still very, very limited understanding, this is what I thought when I realized that the error always happens in threaded http:

0.7 treated threads as "mostly separate"
0.8 has a fundamentally different model
- all I/O happens on one (and only one) thread
- there's a queue
- with more than one thread, the queue becomes concurrent

What I'd expect, if I went looking, was to find some defer calls in the http code. Maybe code that was written to be single-threaded became multi-threaded? Not sure.

michaellenaghan commented 1 year ago

Since you said you were having trouble reproducing this, here's more detail:

(pubsub_clustering) $  make clean

* Building single-file header: fio-stl.h
* Building documentation: fio-stl.md
* Detected the OpenSSL library, setting HAVE_OPENSSL
* Detected the zlib library, setting HAVE_ZLIB

(pubsub_clustering) $ make examples/server -n

* Building single-file header: fio-stl.h
* Building documentation: fio-stl.md
* Detected the OpenSSL library, setting HAVE_OPENSSL
* Detected the zlib library, setting HAVE_ZLIB
mkdir -p tmp tmp/tests tmp/examples  tmp/lib  tmp/src  tmp/tests 2> /dev/null
echo "* Set example flags (server)"
echo "* Compiling examples/server.c"
time cc -c examples/server.c -o tmp/examples/server.o -MT tmp/examples/server.o -MMD -MP -I/opt/local/libexec/openssl3/include -g -std=gnu11 -fpic -DFIO_LEAK_COUNTER -DNDEBUG -DNODEBUG -DHAVE_OPENSSL -DFIO_TLS_FOUND -DHAVE_ZLIB -Wshadow -Wall -Wextra -Wpedantic -Wno-missing-field-initializers -Wformat-security -I. -Ilib -Isrc -Itests  -O3
otool -dtVGX tmp/examples/server.o >> tmp/examples/server.s
echo "* Linking... (tmp/server)"
time cc -o tmp/server tmp/examples/server.o -O3 -L/opt/local/libexec/openssl3/lib -lssl -lcrypto -lpthread -lm -lz
echo "* Finished build (tmp/server)"
tmp/server

(pubsub_clustering) $ make examples/server

* Building single-file header: fio-stl.h
* Building documentation: fio-stl.md
* Detected the OpenSSL library, setting HAVE_OPENSSL
* Detected the zlib library, setting HAVE_ZLIB
* Set example flags (server)
* Compiling examples/server.c
        5.71 real         5.53 user         0.13 sys
* Linking... (tmp/server)
        0.07 real         0.10 user         0.03 sys
* Finished build (tmp/server)
INFO:     
    Starting HTTP echo server example app.
    Engine: kqueue
    Workers: 0  (single process)
    Threads: 1+0    (per worker)
    Press ^C to exit.
INFO:     16368 started listening @ 0.0.0.0:3000
<ctrl+c>
INFO:     Shutdown complete.
INFO:     16368 stopped listening @ 0.0.0.0:3000

(pubsub_clustering) $  ./tmp/server -w 1 -t 2

INFO:     
    Starting HTTP echo server example app.
    Engine: kqueue
    Workers: 1  (cluster mode)
    Threads: 1+2    (per worker)
    Press ^C to exit.
INFO:     16382 spawning 1 workers.
INFO:     16383 worker starting up.
INFO:     16383 started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
INFO:     16383 stopped listening @ 0.0.0.0:3000
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
ERROR:    1 leaks detected for fio___pubsub_message_parser_s
WARNING:  abnormal worker exit detected
INFO:     16385 worker starting up.
INFO:     16385 started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
ERROR:    1 leaks detected for fio___pubsub_message_parser_s
INFO:     16385 stopped listening @ 0.0.0.0:3000
ERROR:    -2 leaks detected for fio
WARNING:  (fio_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fio_malloc) leaked block(s) for chunk 0x10e840100
ERROR:    (fio_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
ERROR:    5 leaks detected for fio_bstr_s
WARNING:  (fiobj_mem_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fiobj_mem_malloc) leaked block(s) for chunk 0x10f440078
ERROR:    (fiobj_mem_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
WARNING:  abnormal worker exit detected
INFO:     16386 worker starting up.
INFO:     16386 started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
ERROR:    1 leaks detected for fio___pubsub_message_parser_s
INFO:     16386 stopped listening @ 0.0.0.0:3000
WARNING:  `write` called after `close` was called for IO.
ERROR:    1 leaks detected for fio
ERROR:    1 leaks detected for fio_stream_packet_s
WARNING:  (fio_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fio_malloc) leaked block(s) for chunk 0x10e840100
ERROR:    (fio_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
ERROR:    6 leaks detected for fio_bstr_s
WARNING:  (fiobj_mem_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fiobj_mem_malloc) leaked block(s) for chunk 0x10f440078
ERROR:    (fiobj_mem_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
WARNING:  abnormal worker exit detected
INFO:     16387 worker starting up.
INFO:     16387 started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
INFO:     16387 stopped listening @ 0.0.0.0:3000
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
ERROR:    1 leaks detected for fio___pubsub_message_parser_s
ERROR:    1 leaks detected for fio
ERROR:    1 leaks detected for fio_stream_packet_s
WARNING:  (fio_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fio_malloc) leaked block(s) for chunk 0x10e840100
ERROR:    (fio_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
ERROR:    6 leaks detected for fio_bstr_s
WARNING:  (fiobj_mem_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fiobj_mem_malloc) leaked block(s) for chunk 0x10f440078
ERROR:    (fiobj_mem_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
WARNING:  abnormal worker exit detected
INFO:     16388 worker starting up.
INFO:     16388 started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
ERROR:    1 leaks detected for fio___pubsub_message_parser_s
INFO:     16388 stopped listening @ 0.0.0.0:3000
ERROR:    1 leaks detected for fio
WARNING:  (fio_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fio_malloc) leaked block(s) for chunk 0x10e840100
ERROR:    (fio_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
ERROR:    3 leaks detected for fio_bstr_s
WARNING:  (fiobj_mem_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fiobj_mem_malloc) leaked block(s) for chunk 0x10f440078
ERROR:    (fiobj_mem_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
WARNING:  abnormal worker exit detected
INFO:     16389 worker starting up.
INFO:     16389 started listening @ 0.0.0.0:3000
FATAL:    fiobj_mem_free called more than fiobj_mem_malloc
FATAL:         errno(60): Operation timed out

ERROR:    120 leaks detected for fio___http_connection
ERROR:    1 leaks detected for fio___http_protocol
INFO:     16389 stopped listening @ 0.0.0.0:3000
ERROR:    1 leaks detected for fio_http
ERROR:    2 leaks detected for fio___http_hmap_destroy
ERROR:    1 leaks detected for fio___pubsub_message_parser_s
ERROR:    3 leaks detected for fio
WARNING:  (fio_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fio_malloc) leaked block(s) for chunk 0x10e840100
ERROR:    (fio_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
ERROR:    1 leaks detected for fio_string_default_allocations
ERROR:    6 leaks detected for fio_bstr_s
WARNING:  (fiobj_mem_malloc) blocks left after cleanup - memory leaks?
WARNING:  (fiobj_mem_malloc) leaked block(s) for chunk 0x10f440078
ERROR:    (fiobj_mem_malloc):
          Total memory chunks allocated after cleanup (POSSIBLE LEAKS): 1
WARNING:  abnormal worker exit detected
<ctrl+c>
INFO:     Shutdown complete.
INFO:     16382 stopped listening @ 0.0.0.0:3000

I used -n above to show the exact config that's being compiled. I'll try some other variations — e.g., no SSL — to see if that makes a difference.

Also:

(pubsub_clustering) $ cc --version

Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: x86_64-apple-darwin22.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

P.S. As I said before, you have to put the server under concurrent load to see the errors.

michaellenaghan commented 1 year ago

I get the same errors with:

$ make examples/server FIO_NO_TLS=1

which is, of course, not surprising.

Here's a less aggressive load:

$ wrk -c 25 -d 10s http://localhost:3000

It still produces errors. (If you look at the log output you'll see that the worker keeps crashing, and the master keeps re-spawning.)

Any other thoughts on what might be different?

michaellenaghan commented 1 year ago

Ah, some other details: I'm on an Intel Mac, and running macOS 13.5.2.

boazsegev commented 12 months ago

Update:

I managed to reproduce the issue (only on Linux, probably because of differences in the thread scheduler between Linux and MacOS).

Apparently the issue is in the memory allocator's double-free testing logic. It uses atomics to set new values, but then reads values from the atomic variables in a non-atomic way, resulting in a race condition when testing if free was called more times than malloc/realloc...

When bypassing the double-free test, the counters balance at the end of the run.

I'm still looking into this, but this is what I found so far.

FYI:

What I'd expect, if I went looking, was to find some defer calls in the http code. Maybe code that was written to be single-threaded became multi-threaded?

Exactly.

Even though the design is different and Server tasks all run on a single thread, the API used by Bothe the FIO_SERVER and FIO_HTTP modules uses fio_srv_defer internally, avoiding race conditions with server-bound data. Where such internal magic couldn't be implemented (such as setting / reading fio_env data), locks were placed to make the API thread safe.

This will eventually be better detailed in the documentation, but for now it's possible to see the comments in the server API documentation, and the comments in some sections / functions.

michaellenaghan commented 12 months ago

OK, it sounds like you're talking about the code in fio___mem_block_free___()?

This may be a silly question, but the Makefile calls for c11. Have you considered using C11 atomics? At a higher level, they shift some responsibility to the compiler — e.g., all uses of an atomic var are automatically atomic. At a lower level they give you control over atomic semantics when more than one atomic var is involved, which seems like it might be the case in that routine. (I'm not sure, I only took a quick look.)

boazsegev commented 12 months ago

I did, but sometimes I had to compile this toolset where C11 wasn't available. Having it still work adds some joy. This is why I preferred compiler builtins over the C11 atomics (though, under the hood they route to the same compiler code).

As for the race condition, it would exist even when using C11 atomics, as the value of the atomic variables may change and cause a gap between when each value is read vs the if statement.

Anyway, the global leak detection macro helpers treat the atomic variables correctly (storing a temporary copy after the atomic operation). I'll switch to these macros and things should work fine.

boazsegev commented 12 months ago

I'm pushing an update to the pubsub_clustering branch. Please let me know if this resolves the issue.

michaellenaghan commented 12 months ago

No more errors!

Just to confirm my impression: the library itself doesn't really take advantage of threads, correct? Rather, threads create a work queue, and it's up to you to write code that uses the queue.

(In quick performance testing threads don't seem to have much impact. I think that's consistent with the model the docs describe?)

===

For the use case I have in mind, it would be helpful to have state-based "on startup thread" and "on shutdown thread" callbacks, with access to thread-specific udata.

In the absence of library support I'll use thread-local storage. The problem is that the thread-local storage will be a leaky abstraction; I'll need to know how the library implements threads on each platform in order to use the corresponding thread-local mechanism — for example, should I or should I not use pthreads on Windows?

Also, in the absence of library support I'll need to check thread-local storage on each call to see if it's been initialized, and initialize if it hasn't been.

Neither of those things is an actual problem.*** I'm just explaining in case you think the use case is interesting, and should be directly supported by the library.

Having said all that: I'm not sure that what I originally had in mind re: thread usage is a good fit for this particular library. It might have been for 0.7, but 0.8 is a rather different beast, and I'm still wrapping my head around it.

And, just to be clear: I have work to do before I even get to seriously thinking about whether or not to use threads for what I have in mind. At this point these are just musings.

===

***Well, "thread shutdown" may be a problem. I don't know if all platforms support destructors for thread-local storage. If they don't, then in the absence of a library-supported shutdown callback I risk leaking memory.

boazsegev commented 12 months ago

Hi @michaellenaghan ,

Use case?

I'm not sure I understand your use-case.

It sounds like you are describing the requirements of one possible solution rather than describing the use case itself.

Maybe there's a better way to utilize the new model and fit it to your needs (for example, such as using the thread-safe fio_env_set/fio_env_get with the IO object).

Could you please describe the use case itself?

Threads in HTTP

As for the threads - I haven't tested the performance implications of the new design quite yet, but the HTTP callback runs in the async threads.

There's of course the issue that the context switching overhead is probably as large (if nor larger) than the actual work that's performed by the HTTP callback in the example.

Also, because the example's work is so light, there's a lot of stress on the queue's lock, making things run slower rather than faster. In an actual app the stress on the queue should be lessened by less uniform task timing.

boazsegev commented 12 months ago

FYI:

You could use the fio_env_set with a NULL IO object (the global environment) and use the value returned by fio_thread_current as the numerical type filter (which I should probably rename to something that isn't a possible keyword in other languages)... this can provide an alternative to local-thread storage.

I'm also pretty sure it's possible to add hooks to the threads starting / finishing in the fio_srv_async_s or fio_queue_s objects.

michaellenaghan commented 12 months ago

As for the threads - I haven't tested the performance implications of the new design quite yet, but the HTTP callback runs in the async threads.

In that case, my mental model was wrong.

Here are some numbers from 0.7 and 0.8, using wrk -c 125 -d 30s -t 4 --latency http://localhost:3000 in all tests.

0.7

make (using boiler_plate)
./tmp/fioapp -p 8086 -w 1 -t 1: ~86,000 RPS
./tmp/fioapp -p 8086 -w 1 -t 2: ~117,000 RPS
./tmp/fioapp -p 8086 -w 1 -t 4: ~125,000 RPS

0.8

make examples/server FLAGS="HTTP_RESPONSE_ECHO=0"
./tmp/server -w 1: ~108,000 RPS
./tmp/server -w 1 -t 2: ~99,000 RPS
./tmp/server -w 1 -t 4: ~90,000 RPS

As you said, you haven't tested performance yet, so this is just "possibly interesting." But, for what it's worth, I can see exactly what the numbers are showing me in the CPU performance graph: no matter how many threads I assign, 0.8 seems to use 1 core. (I don't mean that it actually uses one core, I mean that the line on the graph doesn't go higher than the equivalent of ~1 core.) So something is rate-limiting overall throughput. (0.7 has a more classic look: performance tails off as you saturate all cores.)

It sounds like you are describing the requirements of one possible solution rather than describing the use case itself.

Could you please describe the use case itself?

My goal, in part, is to create a reasonably high performance scripted web server, using Lua as the scripting language. I've experimented with that idea in Civetweb (https://github.com/civetweb/civetweb) and a Civetweb offshoot, libHTTP. They're fundamentally multi-threaded (and specifically not multi-process) web servers. In those servers I use one interpreter per thread, in an essentially shared-nothing model. It's not efficient, but it's certainly simple!

facil.io opens up a different not-efficient-but-certainly-simple model, at least on non-Windows platforms: multiple processes.

Multiple processes might be "good enough." I'm not sure yet. But in 0.7 I was gearing up to test the idea of supporting both multiple processes and multiple threads. I only just started to pull the test together when I stumbled across 0.8. I was initially thinking of shooting for the same thing in 0.8, multiple processes and threads — but at this point I'm not sure that it makes as much sense as it did in 0.7. My impression — possibly wrong, I'm less than a week into this! — is that 0.7 had a somewhat more shared-nothing threading approach than 0.8. 0.8 seems a bit more like libuv, expecting all I/O to happen on one thread and "potentially blocking work" to get handed to a thread pool. (In my case, the primary "potentially blocking work" is calls to SQLite.)

The thread pool idea works as long as there's a simple request-response cycle. In other words: as long as I can call into Lua with a request and get a complete response in one shot. If there's any "chatter" between facil and Lua — specially, chatter that does additional I/O, causing facil to then call back into Lua — the thread pool approach gets… complicated.

(Just to be extra clear: Civetweb creates a lot of threads by default, and I create one interpreter per thread. "Chatter" between Civetweb and Lua doesn't matter as much because the entire conversation with the client involves exactly one interpreter, and there are an awful lot of interpreters. It's inefficient in terms of resources — if there's no I/O latency you just end up thrashing, and getting less overall throughput than you would have with far fewer threads — but it's very simple in terms of mental model.)

Maybe there's a better way to utilize the new model and fit it to your needs (for example, such as using the thread-safe fio_env_set/fio_env_get with the IO object).

I'll have to read up on that.

boazsegev commented 12 months ago

The use case you describe was one of the reasons I decided to switch 0.8.x to a single threaded IO model, while other actions may be performed in other threads.

The issue, for me, was that database calls and other blocking calls could block all threads, leaving the IO hanging.

So although it seems that there's performance loss, the new model should provide stability (existing responses, that are awaiting IO will not hang mid-transport due to a lack of an available thread).

If I understand you correctly, the following options fit your needs:

Design your Lua framework similar to Node.js - single threaded and evented, breaking database calls to other threads and running a callback when they finish;
One root Lua state per process and one Lua co-rutine (or thread) per actual thread;

This would create a shared global state in the Lua interpreter for the different requests, but that's only a problem when using threads... which I understand to be your preference?

It's true that thread start-stop hooks could be used for this approach, but also a Lua state pool approach could be used, where you have a cache storage of possible Lua states and these get "loaned" to a routine and then returned to the pool. This disentangles the C threads from a Lua state, but behaves the same as far as the user is concerned. It also has the advantage that the pool can grow dynamically as the need may arise, allowing you to dynamically control the number of threads being used.

This is similar to how database connections are sometimes managed (when using blocking calls rather than events), for example by Ruby on Rails, where they have a database connection pool.

michaellenaghan commented 12 months ago

There isn't really performance loss. The results are consistent with what the documentation describes. My mental model was a little off. Now I understand my mistake.

In 0.8, all I/O takes place in one thread. So, assuming CPU-bound code, "no threads" represents the best-case scenario. Adding threads can only reduce performance. Again: assuming CPU-bound code. Once you introduce latency, the picture changes.

None of that is new to you. I'm just correcting my mental model. :-)

Design your Lua framework similar to Node.js - single threaded and evented, breaking database calls to other threads and running a callback when they finish;

That isn't really an option. Node was built that way from the start, so all libraries were built to take that into account. One solo developer can't reasonably expect to to turn synchronous Lua libraries into asynchronous Lua libraries on a large scale.

One root Lua state per process and one Lua co-rutine (or thread) per actual thread;

I think that probably is also not an option. Lua's coroutines are excellent — as long as you're only dealing with Lua code. Any cfuncs in the call stack have to manually create their own continuations at all potential suspension points (http://www.lua.org/manual/5.4/manual.html#4.5). Not impossible, but far from trivial — again, especially for a solo developer.

luv (https://github.com/luvit/luv) is a Lua wrapper around libuv, and luvit (https://github.com/luvit/luvit) is a web server written on top of luv: "Node.JS for the Lua Inventor." Performance is just OK — let's say ~12,500 RPS in my testing. But luvit assumes LuaJIT, and holy cow, LuaJIT (like many JITs) can really eat memory: getting those ~12,500 RPS consumes 2.5 GB of memory!

By contrast, my libHTTP approach — with an interpreter per thread — does ~85,000 RPS using ~30 MB of memory. Night and day.

That performance is more than good enough for what I have in mind. My concern is the quality of design and code. That's what made me keep looking for other options. The question is: is it better to build on top of a somewhat shaky foundation, and slowly improve it over time, or switch to something with a far stronger foundation? I'm hoping to answer that question by the end of this week.

It's true that thread start-stop hooks could be used for this approach, but also a Lua state pool approach could be used, where you have a cache storage of possible Lua states and these get "loaned" to a routine and then returned to the pool. This disentangles the C threads from a Lua state, but behaves the same as far as the user is concerned. It also has the advantage that the pool can grow dynamically as the need may arise, allowing you to dynamically control the number of threads being used.

Yes. At one point I pondered building directly on top of libuv using exactly that kind of approach. The problem was — I don't think this will shock you :-) — building a web server is a major project in itself. Now I think I have a vague idea of how to apply that approach to facil.

Tomorrow, just to close out this conversation, I'll explain the underlying motivation for all of this. Have to go now though.

boazsegev commented 12 months ago

Sadly, I know too little of Lua to help, but just to demonstrate my suggested approach:

/* a non-thread safe Lua state cache, usable from the main thread */

#define FIO_ARRAY_NAME my_lua_state_pool
#define FIO_ARRAY_TYPE lua_State *
#include FIO_INCLUDE_FILE

/* the global state variables */

static lua_State *ROOT_LUA_STATE;
static my_lua_state_pool_s LUA_STATE_POOL;
static fio_srv_async_s HTTP_THREAD_POOL;

/* initializing the Lua state, adding functions, loading scripts, etc' */

static void http_lua_root_state_init(void) {
  /* initialize root state... */
  ROOT_LUA_STATE = lua_newstate(NULL, NULL);
  fio_state_callback_add(FIO_CALL_AT_EXIT, (void (*)(void *))lua_close,
                         ROOT_LUA_STATE);
}

/* pre/post HTTP handling */

static lua_State *http_lua_get_state(void) {
  lua_State *r = NULL;
  if (!my_lua_state_pool_pop(&LUA_STATE_POOL, &r) && r) {
    return r;
  }
  r = lua_newthread(ROOT_LUA_STATE);
  FIO_ASSERT_ALLOC(r);
  return r;
}

static void lua_http_after_response_finished(void *s_, void *h_) {
  lua_State *s = (lua_State *)s_;
  fio_http_free(h_);
  if (!s)
    return;
  my_lua_state_pool_push(&LUA_STATE_POOL, s);
}

/* state pool cleanup */

static void http_lua_state_pool_destroy(void *ignr_) {
  lua_State *s;
  while (my_lua_state_pool_count(&LUA_STATE_POOL)) {
    if (my_lua_state_pool_pop(&LUA_STATE_POOL, &s))
      break;
    if (!s)
      continue;
    lua_close(s);
  }
  my_lua_state_pool_destroy(&LUA_STATE_POOL);
  (void)ignr_;
}

/* example Lua HTTP handler */
static void http_lua_respond(void *lua_state_, void *h_) {
  lua_State *s = (lua_State *)lua_state_;
  fio_http_s *h = (fio_http_s *)h_;
  /* TODO: move `h` to Lua state */
  /* TODO: run Lua script within `s` */
  fio_srv_defer(lua_http_after_response_finished, s, h);
}

/* example HTTP handler */
static void http_respond(fio_http_s *h) { /* in main thread */
  lua_State *s = http_lua_get_state();
  fio_srv_async(&HTTP_THREAD_POOL, http_lua_respond, (void *)s,
                (void *)fio_http_dup(h));
}

/* example HTTP listening logic - note that we do NOT use the thread pool yet */
static void start_my_example(void) {
  http_lua_root_state_init();
  fio_http_listen(NULL, .on_http = http_respond);
  fio_srv_async_init(&HTTP_THREAD_POOL, 5);
  /* MAYBE? reset the state pool when server is idle, start fresh. */
  fio_state_callback_add(FIO_CALL_ON_IDLE, http_lua_state_pool_destroy, NULL);
  fio_srv_start(3);
  http_lua_state_pool_destroy(NULL);
  lua_close(ROOT_LUA_STATE);
  ROOT_LUA_STATE = NULL;
}

Note that in this approach we initialize a Thread Pool, but we do not link the thread pool with the HTTP listening port (unlike the example code), so the HTTP handler starts out in the main server thread.

We use the thread pool only after collecting the new Lua state.

This might be a little simplistic and I may be out of my depth, but it should offer support for both forking and threads without using too much memory (I think)... performing cleanup on idle might be too often, but it's just a thought.

Good Luck.

michaellenaghan commented 11 months ago

A question…

There's a single, global sync queue in srv. Why isn't there also a single, global async queue in srv?

I'm having a trouble coming up with a scenario where multiple async queues would make sense, since they'd all be competing (inefficiently) for the same CPU cores.

Am I missing something, some interesting possibility?

Now, it's possible to do this with the existing API; just create a single async queue and pass it to all relevant listeners. A single, global async queue would make the API a little simpler though. More importantly, it would lead to what I think is probably correct API usage. (Right now it's pretty easy to think that you should create an async queue per async listener. That would work, but I think it would be "suboptimal"? Unless I'm missing something. Thus the question.)

michaellenaghan commented 11 months ago

I just realized that the context of my comment might not be clear. :-)

I was thinking about this:

Note that in this approach we initialize a Thread Pool, but we do not link the thread pool with the HTTP listening port (unlike the example code), so the HTTP handler starts out in the main server thread.

It made me wonder why the example code creates an async queue and attaches it to the listener. And it made me think that anyone following that code would probably create an async queue for each listener, etc.

Having multiple listeners makes sense. For example, one might be http, and the other https.

Having some listeners be sync and others be async also makes sense. For example, an http listener that just redirects to https should be sync, since it's CPU bound and an async redirect would just waste CPU.

But even there: the example server kind of implies that a listener is always sync or async, and never both. The code you gave above is better, I think. You could imagine, for example, deciding on sync vs async based on the URL.

So that kind of comes back to my point: why not have a single, global async queue that's owned by srv? Why would anyone ever want more than one async queue?

boazsegev commented 11 months ago

Hi @michaellenaghan ,

I'm having a trouble coming up with a scenario where multiple async queues would make sense, since they'd all be competing (inefficiently) for the same CPU cores.

I just wanted to leave things more flexible - for whatever use cases the future might bring.

The disentanglement provides opportunities for distinguishing between fast routes, blocking (IO bound) routes and task heavy (CPU bound) routes.

CPU bound threads might compete with each other, but IO bound threads less so... Maybe one thread pool in bound with one database and another thread pool with another...

You already pointed out the option of having an HTTP router routing database dependent requests to one queue and CPU heavier requests to another (and perhaps CPU light requests to the global async queue).

For example, the static file service (currently) runs off the global sync queue, so if the async queue is busy you will still be served static content. I figured similar things might happen with other services.

I hope this answers the question.

michaellenaghan commented 11 months ago

Thank you very much for your sample code. I hope to have something running by tomorrow, and it will lean heavily on your ideas.

There are three things that I particularly like:

I like the fact that the main thread manages the interpreter pool.
I like the fact that the interpreter pool can shrink.
I like the fact that the sync/async decision is made on a per-request basis.

I already had the first two points in mind, but the implementation I had in mind was both more complicated and less performant.

And the third point had occurred to me on a per-listener basis — but a per-request basis is, really, so much better. It's very easy to come up with good use cases.

===

There's one key way in which what I'm doing differs from your sample code. Your code assumed Lua-within-facil. What I'm actually doing is facil-within-Lua. Basically, I'm turning facil into a Lua package. Lua drives facil, rather than facil driving Lua.

Here's what I think the Lua code will look like:


lf = require "lfacil"

num_workers = 4
num_threads = 4

async_queue = lf.fio_srv_async_init(num_threads, {
    on_start = function()
        -- per-thread initialization goes here, e.g.:
        lf = require "lfacil"
    end,
    on_http = function(h)
        lf.fio_http_write(h, "Hello (Async) World!");
    end,
    on_finish = function()
        -- per-thread de-initialization goes here
    end,
})

lf.fio_http_listen("http://127.0.0.1:8088", {
    on_http = function(h)
        if lf.fio_http_path(h) == "/" then
            lf.fio_http_write(h, "Hello (Sync) World!");
        else
            async_queue.on_http(h)
        end
    end,
})

lf.fio_srv_start(num_workers)

You can see that it fairly closely mirrors your code, except that it's Lua rather than C.

boazsegev commented 11 months ago

FYI: added worker thread start / end fix_state_callback hooks.

michaellenaghan commented 11 months ago

The sync callback does ~175,000 RPS on my laptop (with three workers). By contrast, I got ~85,000 RPS with libHTTP and ~55,000 RPS with Civetweb.

The async callback, however, crashes. That's both Debug and Release builds, and both the main and pubsub_clustering branches.

It only crashes after the server has been under heavy load for a few moments. So far it looks like a classic multi-threading bug.

===

OK, by running with 0 workers under the debugger I was able to catch the point of failure:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x380201441a)
    frame #0: 0x0000000100014764 cheia`fio___task_ring_push(r=0x0000003802014416, task=fio_queue_task_s @ 0x00007ff7bfefc540) at fio-stl.h:15735:7
   15732    
   15733    FIO_IFUNC int fio___task_ring_push(fio___task_ring_s *r,
   15734                                       fio_queue_task_s task) {
-> 15735      if (r->dir && r->r == r->w)
   15736        return -1;
   15737      r->buf[r->w] = task;
   15738      ++(r->w);

It's pointing to this line:

https://github.com/facil-io/cstl/blob/05c74e0987c280f8b93ff95d2273d0e43c812563/fio-stl.h#L15735

fio___task_ring_push was called here:

https://github.com/facil-io/cstl/blob/05c74e0987c280f8b93ff95d2273d0e43c812563/fio-stl.h#L15780

I'll keep poking around for a bit, but I have to stop soon.

michaellenaghan commented 11 months ago

I forgot to mention one thing: I'm running with 0 workers and 1 thread. So the threading error is between the main thread and the worker thread, not between worker threads.

michaellenaghan commented 11 months ago

Hmmm. If I define FIO_USE_THREAD_MUTEX I get different behaviour; sometimes this:

ERROR:    Couldn't lock mutex @ /Users/.../fio-stl.h:15779 - error (35): Resource temporarily unavailable
fish: Job 1, './bld/debug/opt/cheia/cheia tmp…' terminated by signal SIGSEGV (Address boundary error)

and sometimes this:

ERROR:    Couldn't lock mutex @ /Users/.../fio-stl.h:15779 - error (0): Undefined error: 0
ERROR:    Couldn't release mutex @ /Users/.../fio-stl.h:15806 - error (0): Undefined error: 0

boazsegev commented 11 months ago

I found one thing that could be related - a missing unlock... but that should only happen if it was impossible to spawn a new thread, a situation which I doubt... so I guess I have yet to find the cause for this new issue.

michaellenaghan commented 11 months ago

OK, it's working; there was one small error in facil and one big error in my code.

With 3 workers and one async thread per worker, I'm getting ~157,000 RPS — an excellent result that definitely exceeded my expectations.

(The numbers mean nothing in absolute terms; they're just relative to other tests that I've tried to run under similar circumstances.)

The workers are using more memory than I expected — between 10 and 35 MB at the end of the test. Those aren't big numbers, but bigger than I expected. I'll look into that at some point.

===

I'll send a pull request for the small error in facil, but I have to ask: is there any chance that pubs_clustering will get merged into main soonish? At this point different fixes are going into different branches.

===

Having looked at the async queue code, a question:

Shouldn't the async queue be bounded?

There's no back pressure. If the arrival rate exceeds the processing rate the async queue will grow without bound.

Here's one possible approach:

The async queue has a maximum size — perhaps proportional to the number of threads by default
Pushing to the async queue fails if the queue reaches that size
The main thread can choose to respond to the failure by replying with, say, a 503 error

All of that can happen without touching facil code. (Well, with one exception: the library would have to provide thead-safe access to the queue count. Right now I can get it, but only by reaching into the implementation.) But maybe facil should work that way out of the box?

michaellenaghan commented 11 months ago

I did stress testing, using different configurations and different tools. No errors, no issues. (The "unexpected memory usage" was due to a consistent cache miss. Now memory usage is exactly what I expected.)

Shouldn't the async queue be bounded?

I'll turn that into a separate issue and close this one.

Thanks for the help!

boazsegev commented 11 months ago

With 3 workers and one async thread per worker, I'm getting ~157,000 RPS

Sounds great congrats 🎉

The workers are using more memory than I expected — between 10 and 35 MB at the end of the test...

Are you using there facil.io memory allocators?

If so, each allocator may cache quite a lot of memory depending on how you set it up. The design is supposed to separate long term allocations from short term allocations, minimizing memory fragmentation - but I haven't optimized it all that much just yet.

Each allocation group will cache about 8Mb * 8 cache slots and use about 256Kb for each of it's 26 arenas ... that's about 70Mb per allocator, or 140Mb for both allocators (that's, of course, once you saturate the allocators and they are in full use).

You can try using the system's memory allocators, or jamalloc, but I think optimizing the facil.io allocators should give better results.

Shouldn't the async queue be bounded

I don't think it's safe to have the async queue add back-pressure.

IMHO, Placing pressure on the async queue should be allowed and security concerns about this issue should be handled up-stream.

For example, HTTP connections allow only a single HTTP request to be handled at a time (per client), preventing any client from request-flooding. The WebSocket logic also pauses and resumes the incoming IO in an attempts to prevent a fast client from flooding the server.

This doesn't promise queue flooding security, as this also depends on the rest of the code - but it does prevent request/message flooding.

IMHO, the most likely scenario would be that unscheduled tasks will be performed immediately, which might cause some tasks to run on the wrong thread or for possible deadlocks to occur. Having a memory hungrier queue (or having the worker process crashing) seems much safer.

boazsegev commented 11 months ago

FYI: the pubsub_clustering branch was merged and ... keep me posted about the error you found in the facil.io code base.

michaellenaghan commented 11 months ago

Probably easier to just show you:

https://github.com/facil-io/cstl/blob/c4d5a426c62d38897e7412373857fddbfe55172d/fio-stl/102%20queue.h#L469-L499

Line 493 should be moved to just before line 498. The unsynchronized access to q->mem could cause the wrong decision to be made about freeing the task ring.

In my opinion lines 472 and 473 should also be deleted — or q->count should become atomic. The unsynchronized access could cause the function to return a "NULL task" when there are in fact tasks.