geosolutions-it / C195-azure-workspace

1 stars 2 forks source link

Investigate CKAN freezes #15

Closed etj closed 2 years ago

etj commented 3 years ago

Sometimes CKAN becomes unresponsive. A monitoring/restart script has already been implemented, but we need to find out how to prevent this freeze.

In the dockerized enviroment we have already installed gdb and python3-dbg.

In docker there is a minimal gdb script to help with this issue:

root@d159dcf85475:/# cat gdb.commands
bt
py-bt
py-list
thread apply all py-list
quit

A typical pstree looks like this:

root@d159dcf85475:/# pstree -p
ckan-run.sh(1)---ckan(25)---python3(38)-+-{python3}(43)
                                        |-{python3}(44)
                                        |-{python3}(45)
                                        |-{python3}(46)
                                        |-{python3}(47)
                                        |-{python3}(48)
                                        |-{python3}(49)
                                        |-{python3}(50)
                                        |-{python3}(51)
                                        |-{python3}(52)
                                        |-{python3}(53)
                                        |-{python3}(54)
                                        |-{python3}(55)
                                        `-{python3}(56)
root@d159dcf85475:/#

so all the involved threads can be dumped when attaching to the master PID (38 in this case).

A sample dump can be created using the aforemetioned gdb script with:

gdb -x gdb.commands /usr/lib/ckan/venv/bin/python3 38  > /var/lib/ckan/20210322_gdb_ckan.txt

The gdb script can be improved if needed.

etj commented 3 years ago

As soon as a CKAN instance has been relaunched, a command line of this kind is run by hand in order to setup and allocate as many DB connections as possibile in the pool:

for i in $(seq 1 30) ; do curl  http://ckan-vm.westeurope.cloudapp.azure.com:5000/ >/dev/null &  curl http://ckan-vm.westeurope.cloudapp.azure.com:5000/en/dataset/ > /dev/null & curl http://ckan-vm.westeurope.cloudapp.azure.com:5000/en/organization/  > /dev/null &  done
etj commented 3 years ago

Here a thread dump of a properly running CKAN gdb_ckan_38_running_after90.log

and a quick recap of the lines where each thread is waiting:

cat gdb_ckan_38_running_after90.log | grep '^>' | sort
>291                    self._sleep(self.interval)
>291                    self._sleep(self.interval)
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>300                        gotit = waiter.acquire(True, timeout)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>415                fd_event_list = self._selector.poll(timeout)

Here a dump of a frozen CKAN 20210322_gdb_ckan.txt:

>229            if not set(mode) <= {"r", "w", "b"}:
>291                    self._sleep(self.interval)
>291                    self._sleep(self.interval)
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>300                        gotit = waiter.acquire(True, timeout)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)

Another frozen dump gdb_ckan_30.txt:

>291                    self._sleep(self.interval)
>291                    self._sleep(self.interval)
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>296                    waiter.acquire()
>300                        gotit = waiter.acquire(True, timeout)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>314                    event_buffer = os.read(self._inotify_fd, event_buffer_size)
>589                    return self._sock.recv_into(b)
randomorder commented 2 years ago

can we close this one? let's check it's monitored

randomorder commented 2 years ago

Monitoring in progress https://github.com/geosolutions-it/DevOps/issues/819

I think we can close the issue @etj