LifeboatLLC / Experimental

0 stars 4 forks source link

Infinite Loop on library close due to improper error stack cleanup in threads #13

Open mattjala opened 1 week ago

mattjala commented 1 week ago

If an HDF5 error is emitted in a second thread, then at library close time an infinite loop will occur. H5E_term_package() will repeatedly try and fail to clean up its open identifiers, and an infinite loop will occur.

More specifically, if an error record is placed onto the thread-local error stack in the new thread, the error class and its major and minor message associated with that record will never have their reference count in H5I properly decremented. When they should be cleaned up during library terminations, H5E_term_package() invokes H5I_clear_type() with force=false, so the left over references prevent them from being released.

The root cause is that the end of the thread doesn't trigger any kind of cleanup of the thread-local error stack. I don't think the error stack can be cleaned up properly after thread exit since error stacks aren't stored in H5I (as part of the multi-threading changes, to prevent potential exposure between threads).

This issue can go unnoticed if another API routine is used after the error is generated. In that case, the call to H5E_clear_stack() made by H5_API_ENTER will clear the entries and prevent them from sticking around during library termination.

Build flags used:

--enable-multithread --enable-tests --enable-shared --disable-hl --disable-static --enable-build-mode=debug

Minimal reproduction script:

#include "hdf5.h"
#include <pthread.h>

void *generate_hdf5_error(void *arg) {
  H5Fget_obj_count(H5I_INVALID_HID, H5F_OBJ_ALL);

  return NULL;
}

int main() {
  pthread_t thread;
  void *thread_return = NULL;

  /* Open library and prevent close */
  hid_t file_id = H5Fcreate("err_test.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

  pthread_create(&thread, NULL, generate_hdf5_error, (void *)NULL);
  pthread_join(thread, &thread_return);

  H5Fclose(file_id);

}

Output:

HDF5-DIAG: Error detected in HDF5 (1.14.2) thread 1:
  #000: H5F.c line 246 in H5Fget_obj_count(): not a file id
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5: infinite loop closing library
      L,T_top,P,P,Z,FD,VL,VL,PL,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E,E
Aborted (core dumped)
mattjala commented 1 week ago

This issue arises whenever threadsafety or multithreading is turned on, in the experimental or main branch, because both options enable thread-local error stacks instead of the global error stack.

The changes to error handling in this commit mitigate the problem somewhat, because they remove the reference counting on the library-internal error classes and error messages. With that change, this test program doesn't crash on close, but does leak the memory used for the description of the error record in the stack. However, the underlying issue still exists; if a user registered their own error class or major/minor error messages, and those were pushed onto the error stack, then the infinite close would still occur.

The memory for the error stack itself is not leaked, because the top-level buffer for the error stack is freed by H5TS__key_destructor(). This routine just frees all the buffers associated with an active thread-local key, so it doesn't do any object-specific cleanup. And because the value of the key is set to NULL before the destructor is invoked, there's no safe way to check at key destruction whether the provided key_val is the error stack, so we can't do any error-stack specific cleanup.

pthread_cleanup_push() sets up a thread close handler which could in principle handle a problem like this, but it has to be paired with an invocation of pthread_cleanup_pop() at the name nesting level. So unless the library is the one in control of spawning the thread in the first place, this doesn't seem like it can help.

There is a potential hacky workaround. In order to deal with the fact that there's no set order in which the destructors set up by pthread_key_create() will operate on their keys, we could create two separate thread local keys referring to the error stack, and do something like this:

/* These two point to the same underlying threadlocal error stack */
H5_errstk_key_1_g;
H5_errstk_key_2_g;

...

H5TS__key_destructor(void *key_val) {
    H5E_t *err_stk = NULL;

    /* If this is not the error stack, do regular free */
    // if both key_1 and key_2 still exist, or neither exist, then key_val is not the error stack
    if (key_1 and key_2 both NULL or both non-NULL) {
        free(key_val);
    } else {
        /* This key destructor is acting on key1 or key2, and pthreads already released one of them. */
        if (NULL == pthread_getspecific(H5_errstk_key_1_g)) {
            /* This destructor was invoked on key 1 */
            err_stk = (H5E_t*) pthread_getspecific(H5_errstk_key_2_g);

            /* Prevent the later key_destructor invocation on the other errstk key from repeating this */
            pthread_setspecific(H5_errstk_key_2_g, NULL);

        } else if (NULL == pthread_getspecific(H5_errstk_key_2_g)) {
            /* This destructor is invoked on key 2 */ 
            err_stk = (H5E_t*) pthread_getspecific(H5_errstk_key_1_g);

            /* Prevent the later key_destructor invocation on the other errstk key from repeating this */
            pthread_setspecific(H5_errstk_key_1_g, NULL);
        }

        /* Release all individual records */
        H5E_clear_stack(err_stk);

        /* Free top-level stack buffer */
        free(err_stk);
    } 

}