A process P1 in gds_rundown() gets the ftok semaphore and access semaphore locks in that order
and then can decide to do a wcs_flu() which would then grab crit. It is possible an online
freeze process P2 (MUPIP FREEZE -ON -NOAUTORELEASE) sneaks in concurrently and freezes the
database file just before P1 gets crit. In that case, P1 would sleep-loop indefinitely waiting
for the database to unfreeze (WAIT_FOR_REGION_TO_UNCHILL macro in wcs_flu) and any MUPIP FREEZE
-OFF command (which would clear the online freeze) would hang too waiting for the ftok semaphore
effectively creating a deadlock. This is the issue.
This is now fixed by checking after grabbing crit in wcs_flu() if the database is frozen online
and if so checking if the caller of wcs_flu() is gds_rundown() (indicated by WCSFLU_RET_IF_OFRZ)
and if so the wcs_flu() does not flush the db but instead does a jnl flush (at least flushes the
journal updates this process did) and returns to the caller gds_rundown() which proceeds with
halting this process. That would then release the ftok lock which would let the MUPIP FREEZE -OFF
command to proceed thereby fixing the deadlock.
A process P1 in gds_rundown() gets the ftok semaphore and access semaphore locks in that order and then can decide to do a wcs_flu() which would then grab crit. It is possible an online freeze process P2 (MUPIP FREEZE -ON -NOAUTORELEASE) sneaks in concurrently and freezes the database file just before P1 gets crit. In that case, P1 would sleep-loop indefinitely waiting for the database to unfreeze (WAIT_FOR_REGION_TO_UNCHILL macro in wcs_flu) and any MUPIP FREEZE -OFF command (which would clear the online freeze) would hang too waiting for the ftok semaphore effectively creating a deadlock. This is the issue.
This is now fixed by checking after grabbing crit in wcs_flu() if the database is frozen online and if so checking if the caller of wcs_flu() is gds_rundown() (indicated by WCSFLU_RET_IF_OFRZ) and if so the wcs_flu() does not flush the db but instead does a jnl flush (at least flushes the journal updates this process did) and returns to the caller gds_rundown() which proceeds with halting this process. That would then release the ftok lock which would let the MUPIP FREEZE -OFF command to proceed thereby fixing the deadlock.