Closed etj closed 2 years ago
As soon as a CKAN instance has been relaunched, a command line of this kind is run by hand in order to setup and allocate as many DB connections as possibile in the pool:
for i in $(seq 1 30) ; do curl http://ckan-vm.westeurope.cloudapp.azure.com:5000/ >/dev/null & curl http://ckan-vm.westeurope.cloudapp.azure.com:5000/en/dataset/ > /dev/null & curl http://ckan-vm.westeurope.cloudapp.azure.com:5000/en/organization/ > /dev/null & done
Here a thread dump of a properly running CKAN gdb_ckan_38_running_after90.log
and a quick recap of the lines where each thread is waiting:
cat gdb_ckan_38_running_after90.log | grep '^>' | sort
>291 self._sleep(self.interval)
>291 self._sleep(self.interval)
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>300 gotit = waiter.acquire(True, timeout)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>415 fd_event_list = self._selector.poll(timeout)
Here a dump of a frozen CKAN 20210322_gdb_ckan.txt:
>229 if not set(mode) <= {"r", "w", "b"}:
>291 self._sleep(self.interval)
>291 self._sleep(self.interval)
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>300 gotit = waiter.acquire(True, timeout)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
Another frozen dump gdb_ckan_30.txt:
>291 self._sleep(self.interval)
>291 self._sleep(self.interval)
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>296 waiter.acquire()
>300 gotit = waiter.acquire(True, timeout)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314 event_buffer = os.read(self._inotify_fd, event_buffer_size)
>589 return self._sock.recv_into(b)
can we close this one? let's check it's monitored
Monitoring in progress https://github.com/geosolutions-it/DevOps/issues/819
I think we can close the issue @etj
Sometimes CKAN becomes unresponsive. A monitoring/restart script has already been implemented, but we need to find out how to prevent this freeze.
In the dockerized enviroment we have already installed
gdb
andpython3-dbg
.In docker there is a minimal gdb script to help with this issue:
A typical
pstree
looks like this:so all the involved threads can be dumped when attaching to the master PID (38 in this case).
A sample dump can be created using the aforemetioned gdb script with:
The gdb script can be improved if needed.