Spice makes BSK simulations not thread safe

bbercoviciUspace commented 1 year ago

Describe the bug I ran a parallel simulation featuring multiple instances of SimulationBaseClass , each of them featuring a spiceInterface module. I was expecting all simulations to complete without hurdle, but ended up seeing the multiprocessing.imap holding the individual simulation executors come to a stall, with occasional error messages popping up in the terminal.

The data kernel de430.bsp even got corrupted in the process as its sha1sum turned out different as the one shipped with Basilisk.

To reproduce Steps to reproduce the behavior:

Add module 'spiceInterface' to a 'SimulationBaseClass' instance so that its Reset() method gets called
Spawn multiple instances of the simulation through multiprocessing.imap or multiprocessing.pool
Wait long enough for errors to show up

Expected behavior Basilisk is advertised as capable of running MC simulations but I don't see how the current architecture overcomes Spice's advertised not-thread safe nature. I would expect parallelized simulation to complete without hurdle but the execution stall as well as the kernel data corruption goes to show there is a snag.

Screenshots See https://github.com/AVSLab/basilisk/discussions/203

Desktop (please complete the following information):

OS: Linux 22.04 LTS
Python version 3.10

Additional context The issue with Spice seems to be heavily tied to the library initialization where the data kernels are 'furnished'. It might just do the trick to spawn one instance of SimulationBaseClass outside of the parallel section and have a spiceInterface's Reset() function be called inside this specific instance, before the parallel processing takes place : if steps are taken to inhibit the spiceInterface's loadSpiceData() functions inside the parallel processing, I think the issue of the simultaneous 'furnishing' of the data kernels will be alleviated.

Maybe there's additional caution to be taken regarding the unloading of the data kernels

juan-g-bonilla commented 1 year ago

A possible solution is to ensure furnsh_c and unload_c are protected by a std::mutex which prevents different threads from calling them at the same time. While this does not ensure thread safety (as spice makes no guarantees of thread safety in general) it might make it safe enough for our purposes.

This might slow down parallel simulations, as every simulation has to wait for the rest to load their kernels. However, we can exploit the static/global nature of SPICE to load kernels only once, instead of for every simulation:

std::mutex SpiceInterface::kernel_manipulation_mutex;
std::map<std::string, int> SpiceInterface::kernel_reference_counter;

int SpiceInterface::loadSpiceKernel(char *kernelName, const char *dataPath)
{
    std::string filepath = ...;

    SpiceInterface::kernel_manipulation_mutex.lock();
    SpiceInterface::kernel_reference_counter.try_emplace(filepath, 0);
    if (SpiceInterface::kernel_reference_counter.at(filepath) <= 0)
    {
        // actually load kernel
    }
    SpiceInterface::kernel_reference_counter[filepath]++;
    SpiceInterface::kernel_manipulation_mutex.unlock();
}

int SpiceInterface::unloadSpiceKernel(char *kernelName, const char *dataPath)
{
    std::string filepath = ...;

    SpiceInterface::kernel_manipulation_mutex.lock();
    SpiceInterface::kernel_reference_counter.try_emplace(filepath, 0);
    SpiceInterface::kernel_reference_counter[filepath]--;
    if (SpiceInterface::kernel_reference_counter.at(filepath) <= 0)
    {
        // actually unload kernel
    }
    SpiceInterface::kernel_manipulation_mutex.unlock();
}

The above code makes SpiceInterface keep a static counter of how many times was a kernel requested. The kernel is only loaded the first time it's needed, while it is unloaded once every simulation has called its unloadSpiceKernel. Assuming that loading/unloading only happen once, the rest of operations are O(1) and thus should not slow down the threads by having to go through the lock.

Possible break points:

Individual simulations call unloadSpiceKernel more than once, which would lead to the counter de-synchronizing with the actual kernel needs and thus unloading beforehand. This will cause the simulation to fail.
The opposite, calling loadSpiceKernel without calling unloadSpiceKernel lead to kernels not unloading despite them not being necessary anymore. This will not cause the simulation to fail, but the kernels will not unload and might cause (mild?) memory issues.

I haven't run any test, this is just brainstorming, but I could take a stab at the implementation.

patkenneally commented 1 year ago

Hi @bbercoviciUspace , I'll try to replicate this behavior. I have some suspicions as to cause. If you have additional ideas to reduce the surface area of investigations do share here.

Hi @juan-g-bonilla 👋 As you point out, mutex is a technique for thread control. I'm excited that we're "putting our heads together" on this! However, I think we need to understand the problem and its specific cause/reason for, before prototyping a specific fix. Just trying to save you potentially wasted time here in the case that the solution is not in the realm of a mutex 🙂

atharris-work commented 3 weeks ago

Having also encountered this, I'm 99% sure Ben is right about the issue's source (SPICE in different threads crashing when they simultaneously attempt to load kernels during initialization). Running a large number of short sims with pool.map inside a while loop should replicate the error, at least on Ubuntu 22.04. It could also be a function of random simulation end times - I don't know if there's some OS-level deconfliction that could happen when you have simultaneous reads rather than near-simultaneous ones.

It would be nice to have a thread-safe solution for planet orientations/positions. Nyx Space's ANISE (https://github.com/nyx-space/anise) seems like it could be one solution (also written by an AVS alum). You can always work around the core thread safety issue by maintaining a separate copy of the kernel files on disk for each thread, but that becomes impractical rather quickly, especially for large kernels like de430.bsp.

juan-g-bonilla commented 3 weeks ago

@atharris-work thank you for bringing up this issue again and your investigation. We have encountered similar problems when working with SPICE at my lab at JPL, but haven’t implemented a solution since it happens rarely for us. We have brainstormed that a locking mechanism around the kernel loading operation would be a simple yet effective fix, either the same lock for all files (simpler implementation, faster for batch loading multiple files) or a lock per file (less contention). To complicate things further, this error also appears with multiple processes, so a cross-process locking mechanism might be necessary for a full fix. We would be interested on whatever solution is implemented for Basilisk, so please do add me as reviewer if you write a PR.

As for changing the library from SPICE to something else: I think it might be a bit radical for solving such a rare issue if a simpler fix is available. A C++ thread-safe SPICE implementation should be released by NAIF at some point in the future anyway, so we might wait until then.

AVSLab / basilisk

Spice makes BSK simulations not thread safe #220