Closed Justintime50 closed 1 year ago
I thought I fixed this via https://github.com/Justintime50/harvey/commit/53c097c326b4a308c423755263bb36f9f7633d5a#diff-56e74ba8155db5498d52b15ac4600497a1f11c57a2e91c9af11f6ed5b7137b08R111, but it appears that builds can still get stuck randomly. I've traced the code and am unsure why this is happening because they should continue to run just fine.
Aha, FINALLY found where the problem was occurring, not necessarily what the problem is:
Pulling Harvey config for justintime50/os-scripting from webhook...
!!! uWSGI process 76997 got Segmentation Fault !!!
*** backtrace of 76997 ***
0 uwsgi 0x0000000102a2a09c uwsgi_backtrace + 52
1 uwsgi 0x0000000102a2a5b0 uwsgi_segfault + 56
2 libsystem_platform.dylib 0x00000001843802a4 _sigtramp + 56
3 libdispatch.dylib 0x00000001841df900 _dispatch_apply_with_attr_f + 1096
4 libdispatch.dylib 0x00000001841dfb48 dispatch_apply + 108
5 CoreFoundation 0x0000000184557eb4 __103-[CFPrefsSearchListSource synchronouslySendSystemMessage:andUserMessage:andDirectMessage:replyHandler:]_block_invoke.52 + 132
6 CoreFoundation 0x00000001843e7a40 CFPREFERENCES_IS_WAITING_FOR_SYSTEM_AND_USER_CFPREFSDS + 100
7 CoreFoundation 0x00000001845570e4 -[CFPrefsSearchListSource synchronouslySendSystemMessage:andUserMessage:andDirectMessage:replyHandler:] + 232
8 CoreFoundation 0x00000001843e6160 -[CFPrefsSearchListSource alreadylocked_generationCountFromListOfSources:count:] + 232
9 CoreFoundation 0x00000001843e5e6c -[CFPrefsSearchListSource alreadylocked_getDictionary:] + 468
10 CoreFoundation 0x00000001843e59f0 -[CFPrefsSearchListSource alreadylocked_copyValueForKey:] + 172
11 CoreFoundation 0x00000001843e5924 -[CFPrefsSource copyValueForKey:] + 52
12 CoreFoundation 0x00000001843e58d8 __76-[_CFXPreferences copyAppValueForKey:identifier:container:configurationURL:]_block_invoke + 32
13 CoreFoundation 0x00000001843ddf8c __108-[_CFXPreferences(SearchListAdditions) withSearchListForIdentifier:container:cloudConfigurationURL:perform:]_block_invoke + 376
14 CoreFoundation 0x0000000184558764 -[_CFXPreferences withSearchListForIdentifier:container:cloudConfigurationURL:perform:] + 384
15 CoreFoundation 0x00000001843dd860 -[_CFXPreferences copyAppValueForKey:identifier:container:configurationURL:] + 168
16 CoreFoundation 0x00000001843dd77c _CFPreferencesCopyAppValueWithContainerAndConfiguration + 112
17 SystemConfiguration 0x0000000184fab8ec SCDynamicStoreCopyProxiesWithOptions + 180
18 _scproxy.cpython-310-darwin.so 0x0000000103703aa0 get_proxies + 28
19 Python 0x000000010301cba8 cfunction_vectorcall_NOARGS + 96
20 Python 0x00000001030c4cf8 call_function + 128
21 Python 0x00000001030c2538 _PyEval_EvalFrameDefault + 43144
22 Python 0x00000001030b6a5c _PyEval_Vector + 376
23 Python 0x00000001030c4cf8 call_function + 128
24 Python 0x00000001030c2538 _PyEval_EvalFrameDefault + 43144
25 Python 0x00000001030b6a5c _PyEval_Vector + 376
26 Python 0x00000001030c4cf8 call_function + 128
27 Python 0x00000001030c2538 _PyEval_EvalFrameDefault + 43144
28 Python 0x00000001030b6a5c _PyEval_Vector + 376
29 Python 0x0000000102fcaeac _PyObject_FastCallDictTstate + 96
30 Python 0x0000000103040abc slot_tp_init + 196
31 Python 0x0000000103038a8c type_call + 288
32 Python 0x0000000102fcac44 _PyObject_MakeTpCall + 136
33 Python 0x00000001030c4d88 call_function + 272
34 Python 0x00000001030c2538 _PyEval_EvalFrameDefault + 43144
35 Python 0x00000001030b6a5c _PyEval_Vector + 376
36 Python 0x00000001030c4cf8 call_function + 128
37 Python 0x00000001030c2538 _PyEval_EvalFrameDefault + 43144
38 Python 0x00000001030b6a5c _PyEval_Vector + 376
39 Python 0x00000001030c4cf8 call_function + 128
40 Python 0x00000001030c25c0 _PyEval_EvalFrameDefault + 43280
41 Python 0x00000001030b6a5c _PyEval_Vector + 376
42 Python 0x0000000102fcdeb0 method_vectorcall + 124
43 Python 0x00000001030c4cf8 call_function + 128
44 Python 0x00000001030c25c0 _PyEval_EvalFrameDefault + 43280
45 Python 0x00000001030b6a5c _PyEval_Vector + 376
46 Python 0x0000000102fcdeb0 method_vectorcall + 124
47 Python 0x00000001030c4cf8 call_function + 128
48 Python 0x00000001030c25c0 _PyEval_EvalFrameDefault + 43280
49 Python 0x00000001030b6a5c _PyEval_Vector + 376
50 Python 0x0000000102fcdeb0 method_vectorcall + 124
51 Python 0x00000001030c4cf8 call_function + 128
52 Python 0x00000001030c25c0 _PyEval_EvalFrameDefault + 43280
53 Python 0x00000001030b6a5c _PyEval_Vector + 376
54 Python 0x0000000102fcdeb0 method_vectorcall + 124
55 Python 0x00000001030c4cf8 call_function + 128
56 Python 0x00000001030c25c0 _PyEval_EvalFrameDefault + 43280
57 Python 0x00000001030b6a5c _PyEval_Vector + 376
58 Python 0x0000000102fcdeb0 method_vectorcall + 124
59 Python 0x00000001030c4cf8 call_function + 128
60 Python 0x00000001030c25c0 _PyEval_EvalFrameDefault + 43280
61 Python 0x00000001030b6a5c _PyEval_Vector + 376
62 Python 0x00000001030c4cf8 call_function + 128
63 Python 0x00000001030c2510 _PyEval_EvalFrameDefault + 43104
*** end of backtrace ***
After initially looking, a pattern seemed to emerge where the RSS memory right before each segfault was 22mb. This led me to believe that the process can't grow beyond this limit for some reason? for now, I've added an extra process so there is always at least 3 in the mix and set reload-on-rss
to 22 for now to see if this helps. What this means is that processes are getting restarted very frequently which isn't ideal; however, this may be just the workaround we need for now and we can productionize it more down the road. Time will tell if this actually did the trick.
I was able to find the solution to the problem, it appears to be specific to macOS which is where I'm running uwsgi. Per https://bugs.python.org/issue30385 and https://github.com/unbit/uwsgi/issues/1722, it was suggested to add os.environ["no_proxy"] = "*"
to the app. This removes the reliance on the macOS CFPREFERENCES
which is ultimately what was causing the segfault. This may not be a perfect solution; however, it is working for my use-case. The app has been running for 5 days now uninterrupted when previously it couldn't make it 24 hours without segfaulting which shows promise and I have no need for a proxy.
There is some race condition happening where deployments get stuck after pulling in changes before deploying. Subprocess operations have a timeout set on them but it doesn't appear that it's either taking effect or correctly exiting after. I have multiple deployments that have started but never stopped nor errored out, just stuck in "in-progress".