Open raymondEhlers opened 1 year ago
performForkChildInitialize
in the stack trace makes me suspicious that this is something to do with parsl's not very clean/safe use of fork and threads.
Does the stack trace from SIGABRT look the same when using spawn
?
I'd expect abort_with_reason
(also in the stack trace) to give some human readable reason somewhere, because the signature looks like this:
void abort_with_reason(uint32_t reason_namespace, uint64_t reason_code, const char *reason_string, uint64_t reason_flags) __attribute__((noreturn));
That message should also get stored/logged somewhere by CRSetCrashLogMessage
according to apples source code for that bit of objc*
But... I don't know where that reason gets logged/stored that is accessible to you, because I've never worked with this bit of objc before. From your report, it looks like not the stderr of the worker process.
Can you paste in the full complete output of the stack trace? (preferably using spawn
because that's the cleanest for parsl, I think)
Thanks for your response! I'm at a conference this week, and I'm having some trouble immediately catching the segfault. I will have to come back to this, but I will collect the info as soon as possible.
In the meantime, I found the whole stack trace in my terminal scollback (I think it was from using fork, but I'm not 100% sure at this point. I think fork and spawn had the same trace, but I'm also not 100% sure and will need to check):
One stray question: for debugging, I believe I only have one core assignment, but I see three process_worker_pool.py
processes. Is that expected? (ie. 3 processes for 1 task/core. edit: maybe it's 1 process/core + 2 overhead?). It makes attaching the debugger a bit more involved, so it would be nice if I could attach to just one. Thanks!
edit: I suppose this is a separate bug report, but for the parsl master, I've seen the logs blow up when it can't record the io resources. It grows to multiple GBs in a few minutes:
I tried again and was able to catch the segfault again. The trace when using "spawn" is below:
As far as I can tell, this seems to be identical to what I posted before with fork
. I found more in Console
, which has the following information:
System Integrity Protection: enabled
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Reason: Namespace OBJC, Code 1
Application Specific Information:
*** multi-threaded process forked ***
crashed on child side of fork pre-exec
Kernel Triage:
VM - pmap_enter failed with resource shortage
VM - pmap_enter failed with resource shortage
VM - pmap_enter failed with resource shortage
VM - pmap_enter failed with resource shortage
VM - pmap_enter failed with resource shortage
<And then the stack trace that I copied above is here>
I unfortunately wasn't able to find this definition searching around, but I guess based on what you found above, this should be enough to find it? This may be it? https://github.com/showxu/objc4/blob/b73f5d4700db192ffdc91b5ead36f3ddf8bfe174/objc4/runtime/objc-internal.h#L54 If so, it doesn't seem especially helpful :-(
Is your feature request related to a problem? Please describe. I've used parsl successfully on a number of clusters with slurm (thanks!) using HTEX + SlurmProvider. I recently tried to use some of this code to run some smaller tasks on my M1 macbook (using HTEX + Local Provider), and when running a particular set of tasks that work with HTEX + SlurmProvider, the worker(s) on my mac are always lost (ie.
WorkerLost: Task failure due to loss of worker 0 on host ...
). Since it's a python app, I can't retrieve the app log. The worker logs just say that the worker died while running a task. (This is somewhere between a question, a feature request, and a bug report.)Describe the solution you'd like Some additional documentation on techniques to debug lost workers and/or python apps when you need the logs would be extremely helpful. (I did get a bit of info from the submit script stderr, which suggests a fork issue - see below). My usual strategy for debugging python app is just to comment out the
python_app
wrapper and run it immediately on my local node. However, in this case, the app works correctly, so this doesn't tell me anything, so additional suggestions on techniques would be great!Describe alternatives you've considered I've looked through everything that I know of in the runinfo. The worker logs just say that it received the task, and then the worker died. There's some info from the stderr, but without further context of what the app was doing at that point, it's hard to determine more
Additional context
Everything appears fine in the runinfo logs until the worker dies except in the submit script stderr, which has this info:
This seems to suggest some issue with forking, so I grabbed the parsl master and tried switching the start_method to
spawn
andthread
. This error disappeared from the sderr, but both method still suffered from the same issue of the worker dying.I managed to attach
lldb
to theprocess_worker_pool.py
workers, which indicates asignal SIGABRT
. The end of the trace is below. It may be overly focused on my case, but I include it here from completeness.Unfortunately, I haven't been able to find an easy reproducer (my code is far too involved for a reproducer), which is why I'm not filing this as a bug report (plus, although I suspect forking with parsl to be the issue, I cannot rule other things out definitively). I've found that simple tasks work fine, but using tasks which involve loading c++ code (mainly through lazily loaded ROOT, which in the past has always been enough to avoid issues with loading in this manner). edit: using ThreadPoolExecutor works, so that suggests something that HTEX is doing? (my workload is CPU limited, so this isn't really a viable workaround as far as I can tell)
I understand that you're unlikely to be able to solve this directly, but any suggestions on how to further understand this issue or work around it would be greatly appreciated! Thanks in advance!
Versions: parsl: I've tried with 6d1f9160d487b3265f6e9d65ebb357837a437c30, as well as 56491bc2d7909191348ebdaf9330d3f2d06845b3 (current master) macOS: 12.5.1 Monterey python: 3.10.7 via conda