TRIQS / tprf

TPRF: The Two-Particle Response Function tool box for TRIQS
https://triqs.github.io/tprf
Other
14 stars 12 forks source link

Exception inside OpenMP loops cause terminate #44

Open hmenke opened 10 months ago

hmenke commented 10 months ago

Prerequisites

Description

Sometimes we make mistakes, which leads TRIQS to throw exceptions. For example when trying to Fourier transform a Green's function with only a single frequency something like this happens:

Example Python script ```python import numpy as np from triqs.gf import MeshImFreq from triqs_tprf.tight_binding import TBLattice t = 1.0 H = TBLattice( units = [(1, 0, 0), (0, 1, 0)], hopping = { # nearest neighbour hopping -t ( 0,+1): -t * np.eye(2), ( 0,-1): -t * np.eye(2), (+1, 0): -t * np.eye(2), (-1, 0): -t * np.eye(2), }, orbital_positions = [(0,0,0)]*2, orbital_names = ['up', 'do'], ) kmesh = H.get_kmesh(n_k=(32, 32, 1)) e_k = H.fourier(kmesh) from triqs_tprf.lattice import lattice_dyson_g0_wk, fourier_wk_to_wr, fourier_wr_to_tr iw_mesh = MeshImFreq(beta=10.0, S='Fermion', n_max=1) g0_wk = lattice_dyson_g0_wk(mu=0.0, e_k=e_k, mesh=iw_mesh) g0_wr = fourier_wk_to_wr(g0_wk) g0_tr = fourier_wr_to_tr(g0_wr) ```
libc++abi: terminating due to uncaught exception of type triqs::runtime_error: Triqs runtime error
    at /usr/include/triqs/./mesh/./tail_fitter.hpp : 157

Insufficient data points for least square procedure
Exception was thrown on node 
Aborted (core dumped)

However, as you can see this is not a Python exception but an unhandled C++ exception has caused the entire process to abort. This is quite annoying when prototyping in a Jupyter notebook, because every time this happens the entire Jupyter kernel dies.

After some digging I found that this is due to the fact that exception are not allowed to leave OpenMP parallel regions. From the OpenMP specifiction:

A throw executed inside a parallel region must cause execution to resume within the same parallel region, and the same thread that threw the exception must catch it.

Steps to Reproduce

Trying to catch an exception thrown inside a parallel region outside of it causes abort() to be called.

#include <iostream>
#include <stdexcept>

void do_stuff(int i) {
    if (i == 5) {
        throw std::out_of_range("oops");
    }
}

int main() {
    try {
        #pragma omp parallel for
        for (int i = 0; i < 10; ++i) {
            do_stuff(i);
        }
    } catch (std::exception const &e) {
        std::cout << "Exception occurred: " << e.what() << "\n";
    }
}

One possibility would be to embellish all the parallel regions with a std::exception_ptr which stores the last uncaught exception and rethrows it outside the region. This does not cover the of multiple (possibly different) exceptions being thrown on different threads, but I also don't see a straightforward way to convert a stack of C++ exceptions into a Python exception.

#include <exception>
#include <iostream>
#include <stdexcept>

void do_stuff(int i) {
    if (i == 5) {
        throw std::out_of_range("oops");
    }
}

int main() {
    try {
        std::exception_ptr eptr;
        #pragma omp parallel for
        for (int i = 0; i < 10; ++i) {
            try {
                do_stuff(i);
            } catch (...) {
                #pragma omp critical
                eptr = std::current_exception();
            }
        }
        if (eptr) {
            std::rethrow_exception(eptr);
        }
    } catch (std::exception const &e) {
        std::cout << "Exception occurred: " << e.what() << "\n";
    }
}

Performance in the exceptional case where individual loop iteration might take long can further be improved by using OpenMP cancellation points. However, this requires that the user exports OMP_CANCELLATION=1

#include <chrono>
#include <exception>
#include <iostream>
#include <stdexcept>
#include <thread>

void do_stuff(int i) {
    using namespace std::chrono_literals;
    std::this_thread::sleep_for(i*10ms);
    if (i == 5) {
        throw std::out_of_range("oops");
    }
}

int main() {
    try {
        std::exception_ptr eptr;
        #pragma omp parallel for
        for (int i = 0; i < 100; ++i) {
            try {
                do_stuff(i);
            } catch (...) {
                #pragma omp critical
                eptr = std::current_exception();
                #pragma omp cancel for
            }
            #pragma omp cancellation point for
        }
        if (eptr) {
            std::rethrow_exception(eptr);
        }
    } catch (std::exception const &e) {
        std::cout << "Exception occurred: " << e.what() << "\n";
    }
}

Expected behavior: Get a Python exception

Actual behavior: Unhandled C++ exception causes abort()

Versions

$ python3 -c "from triqs_tprf.version import *; show_version(); show_git_hash();"

You are using triqs_tprf version 3.2.0

You are using triqs_tprf git hash ce36521536d8b7acdcb693fe2d0d15135ecb16fd based on triqs git hash e1fa5dd2c8984e334574163f6323e956a49ffbd5

$ grep VERSION= /etc/os-release 
VERSION="20.04.4 LTS (Focal Fossa)"

Formatting

Please use markdown in your issue message. A useful summary of commands can be found here.

Additional Information

Any additional information, configuration or data that might be necessary to reproduce the issue.