facebookarchive / libphenom

An eventing framework for building high performance and high scalability systems in C.
http://facebook.github.io/libphenom
Apache License 2.0
1.66k stars 362 forks source link

Proper way to close a connection from the server side? #68

Closed tobz closed 10 years ago

tobz commented 10 years ago

Hey y'all!

I've been working through trying to build a grossly oversimplified static web server using libphenom. I basically took a look at the echo server example, and started modifying. My code is here: https://gist.github.com/anonymous/5f51f56b87a2f6844d03

I'm hitting a problem where, when I'm done copying the file to the socket stream, I want to immediately disconnect the connection. Right now, my code for doing that is borrowed from roughly what the echo server does when it hits an actual error:

ph_sock_enable(sock, false);
ph_sock_shutdown(sock, PH_SOCK_SHUT_RDWR);
ph_mem_free(mt_parser, state->parser);
ph_mem_free(mt_state, state);
ph_sock_free(sock);

What I'm seeing is that, after a few requests, or if I do a big burst of them all at once, there's a failed assertion about a pending free on the socket:

1407245254.540 panic: sched/1 assertion job->epoch_entry.function == NULL failed: job is pending free at corelib/nbio/common.c:486
1407245254.540 panic: sched/1 Fatal error detected at:
1407245254.540 panic: sched/1 /usr/lib64/libphenom.so.0(ph_log_stacktrace+0x29) [0x7ffff7db45d9]
1407245254.544 debug: phenom:sched/0 fd=17 setting mask=1 timeout={60,0}
1407245254.540 panic: sched/1 /usr/lib64/libphenom.so.0(ph_panic+0xbe) [0x7ffff7db46ee]
1407245254.540 panic: sched/1 /usr/lib64/libphenom.so.0(+0x1482f) [0x7ffff7db682f]
1407245254.540 panic: sched/1 /usr/lib64/libphenom.so.0(ph_job_set_nbio_timeout_in+0x73) [0x7ffff7db6993]
1407245254.540 panic: sched/1 /usr/lib64/libphenom.so.0(ph_nbio_emitter_run+0xe0) [0x7ffff7db7730]
1407245254.540 panic: sched/1 /usr/lib64/libphenom.so.0(+0x14c51) [0x7ffff7db6c51]
1407245254.540 panic: sched/1 /usr/lib64/libphenom.so.0(+0x1a070) [0x7ffff7dbc070]
1407245254.540 panic: sched/1 /lib64/libpthread.so.0() [0x334da079d1]
1407245254.540 panic: sched/1 /lib64/libc.so.6(clone+0x6d) [0x334d2e8b6d]

The code to disconnect the socket is queued up in a separate job call to try and let the NBIO stack do it's thing, but I feel like I'm missing something here.

Is there something I'm missing to have to shut down cleanly without this race condition? Is my approach entirely wrong?

wez commented 10 years ago

We chatted about this on IRC, I'm just leaving some summary info here in case others stumble across this and wonder about the resolution.

The assert is most likely triggered by the unconditional 50ms scheduled disconnect job. The recommended approach for this is to move to a WRITING state when you know that you are waiting for the write buffer in the sock to drain.

You can check to see if it has drained using logic like this in your sock callback function:

if (state == WRITING && ph_bufq_len(sock->wbuf) == 0) {
  // We have drained our write buffer
}

if you want to detect when you can safely tear down the sock and have other dependent/related jobs that might be in flight, particularly if you are using ph_sock_wakeup to chain them together, you will want to use logic like this:

if (state == WRITING && ph_bufq_len(sock->wbuf) == 0) {
  // We have drained our write buffer, is it safe to shutdown?
  if (!ph_job_has_pending_wakeup(&sock->job)) {
     // We know that we have no other work in flight and that we are done sending
     // all of our data, so we can tear things down
     ph_sock_shutdown(sock, PH_SOCK_SHUT_RDWR);
     ph_sock_enable(sock, false);
     ph_sock_free(sock);
     return;
  }
}

Regarding this particular example of sending a payload to the client, I recommend avoiding creating additional timers; you don't really need them here and the workload is begging to run in a thread pool instead, so that the local file io doesn't block the nbio scheduler threads and impact overall service throughput.

Here's how I would approach this use case; in your sock dispatcher function, the logic would be something like this:

if (state == READ_REQUEST) {
   if (read_and_parse_request(...)) {
      // we know enough to dispatch it
      file_read_job = ph_job_alloc(&file_read_def);
      file_read_job->data = {whatever state we need};
      // disable the sock events while we wait for the file io stuff to complete
      ph_sock_enable(sock, false);
      ph_job_set_pool(file_read_job, file_read_pool);
      return;
   }
   // need more data
   return;
}
if (state == WRITE_QUEUED) {
   // we're waking up after the file read job is done.  Let's make sure
   // we're properly enabled
   ph_sock_enable(sock, true);
   state = WRITING;
}

if (state == WRITING) {
   // Stuff I pointed out above
}

the file_read_def should be a thread pool job definition; you should pre-create the thread pool during application initialization:

struct ph_job_def file_read_def = {
   do_file_read,
   ...
};

// These are application defaults for the queue size and number of threads,
// they can be overridden at deploy time using the json configuration file
file_read_pool = ph_thread_pool_define("fileread", 10240, 8);

and in do_file_read:

static void do_file_read(ph_job_t *job, ph_iomask_t why, void *data) {
   request_state *reqstate = data;

   ph_stream_t *f = ph_stm_file_open(.....);
   ph_stm_printf(reqstate->sock->stream, "stuff");

   ph_stm_close(f);

   reqstate->state = WRITE_QUEUED;

   ph_sock_wakeup(reqstate->sock);

  // we're done with this file read job
  ph_job_free(job);
}
tobz commented 10 years ago

Also for anyone who stumbles here, this is my resulting code after changing the original approach to be more state machine like + using a thread pool over a timer: https://gist.github.com/tobz/95fa46cf36e50d1cdda2

It properly disconnects now and supports thousands of requests per second without any double frees or segfaults or what have you. :)