Open bbercoviciUspace opened 1 year ago
A possible solution is to ensure furnsh_c
and unload_c
are protected by a std::mutex
which prevents different threads from calling them at the same time. While this does not ensure thread safety (as spice makes no guarantees of thread safety in general) it might make it safe enough for our purposes.
This might slow down parallel simulations, as every simulation has to wait for the rest to load their kernels. However, we can exploit the static/global nature of SPICE to load kernels only once, instead of for every simulation:
std::mutex SpiceInterface::kernel_manipulation_mutex;
std::map<std::string, int> SpiceInterface::kernel_reference_counter;
int SpiceInterface::loadSpiceKernel(char *kernelName, const char *dataPath)
{
std::string filepath = ...;
SpiceInterface::kernel_manipulation_mutex.lock();
SpiceInterface::kernel_reference_counter.try_emplace(filepath, 0);
if (SpiceInterface::kernel_reference_counter.at(filepath) <= 0)
{
// actually load kernel
}
SpiceInterface::kernel_reference_counter[filepath]++;
SpiceInterface::kernel_manipulation_mutex.unlock();
}
int SpiceInterface::unloadSpiceKernel(char *kernelName, const char *dataPath)
{
std::string filepath = ...;
SpiceInterface::kernel_manipulation_mutex.lock();
SpiceInterface::kernel_reference_counter.try_emplace(filepath, 0);
SpiceInterface::kernel_reference_counter[filepath]--;
if (SpiceInterface::kernel_reference_counter.at(filepath) <= 0)
{
// actually unload kernel
}
SpiceInterface::kernel_manipulation_mutex.unlock();
}
The above code makes SpiceInterface
keep a static counter of how many times was a kernel requested. The kernel is only loaded the first time it's needed, while it is unloaded once every simulation has called its unloadSpiceKernel. Assuming that loading/unloading only happen once, the rest of operations are O(1) and thus should not slow down the threads by having to go through the lock.
Possible break points:
unloadSpiceKernel
more than once, which would lead to the counter de-synchronizing with the actual kernel needs and thus unloading beforehand. This will cause the simulation to fail.loadSpiceKernel
without calling unloadSpiceKernel
lead to kernels not unloading despite them not being necessary anymore. This will not cause the simulation to fail, but the kernels will not unload and might cause (mild?) memory issues.I haven't run any test, this is just brainstorming, but I could take a stab at the implementation.
Hi @bbercoviciUspace , I'll try to replicate this behavior. I have some suspicions as to cause. If you have additional ideas to reduce the surface area of investigations do share here.
Hi @juan-g-bonilla 👋 As you point out, mutex is a technique for thread control. I'm excited that we're "putting our heads together" on this! However, I think we need to understand the problem and its specific cause/reason for, before prototyping a specific fix. Just trying to save you potentially wasted time here in the case that the solution is not in the realm of a mutex 🙂
Having also encountered this, I'm 99% sure Ben is right about the issue's source (SPICE in different threads crashing when they simultaneously attempt to load kernels during initialization). Running a large number of short sims with pool.map
inside a while
loop should replicate the error, at least on Ubuntu 22.04. It could also be a function of random simulation end times - I don't know if there's some OS-level deconfliction that could happen when you have simultaneous reads rather than near-simultaneous ones.
It would be nice to have a thread-safe solution for planet orientations/positions. Nyx Space's ANISE (https://github.com/nyx-space/anise) seems like it could be one solution (also written by an AVS alum). You can always work around the core thread safety issue by maintaining a separate copy of the kernel files on disk for each thread, but that becomes impractical rather quickly, especially for large kernels like de430.bsp
.
@atharris-work thank you for bringing up this issue again and your investigation. We have encountered similar problems when working with SPICE at my lab at JPL, but haven’t implemented a solution since it happens rarely for us. We have brainstormed that a locking mechanism around the kernel loading operation would be a simple yet effective fix, either the same lock for all files (simpler implementation, faster for batch loading multiple files) or a lock per file (less contention). To complicate things further, this error also appears with multiple processes, so a cross-process locking mechanism might be necessary for a full fix. We would be interested on whatever solution is implemented for Basilisk, so please do add me as reviewer if you write a PR.
As for changing the library from SPICE to something else: I think it might be a bit radical for solving such a rare issue if a simpler fix is available. A C++ thread-safe SPICE implementation should be released by NAIF at some point in the future anyway, so we might wait until then.
Describe the bug I ran a parallel simulation featuring multiple instances of
SimulationBaseClass
, each of them featuring aspiceInterface
module. I was expecting all simulations to complete without hurdle, but ended up seeing themultiprocessing.imap
holding the individual simulation executors come to a stall, with occasional error messages popping up in the terminal.The data kernel
de430.bsp
even got corrupted in the process as itssha1sum
turned out different as the one shipped with Basilisk.To reproduce Steps to reproduce the behavior:
Reset()
method gets calledmultiprocessing.imap
or multiprocessing.poolExpected behavior Basilisk is advertised as capable of running MC simulations but I don't see how the current architecture overcomes Spice's advertised not-thread safe nature. I would expect parallelized simulation to complete without hurdle but the execution stall as well as the kernel data corruption goes to show there is a snag.
Screenshots See https://github.com/AVSLab/basilisk/discussions/203
Desktop (please complete the following information):
Additional context The issue with Spice seems to be heavily tied to the library initialization where the data kernels are 'furnished'. It might just do the trick to spawn one instance of
SimulationBaseClass
outside of the parallel section and have aspiceInterface
's Reset() function be called inside this specific instance, before the parallel processing takes place : if steps are taken to inhibit thespiceInterface's
loadSpiceData()
functions inside the parallel processing, I think the issue of the simultaneous 'furnishing' of the data kernels will be alleviated.Maybe there's additional caution to be taken regarding the unloading of the data kernels