Ken started Quickjs backed indexer processes freeze

Description

On a cluster (3.4.2) of 6 nodes that has a fairly large amount of databases (~1500 of size between 1GB and 150GB each), we have recently added and modified about 20 design docs per database (total 30000 dds). We have setup Ken with a concurrency of 5 to let the indexation happen.

About every 10 minutes, we see one indexer process not being updated anymore. It basically stays stuck forever (we let a few linger for 24 hours). Killing all couchjs_mainjs has no impact on the stuck indexer. The only way to get rid of it is to issue, in remsh, an exit(Pid, kill). . Pid here is the pid field of /_active_task, not indexer_pid.

Steps to Reproduce

Create a lot of databases with a lot of data, create a few design documents per database, start ken.

Expected Behaviour

The indexers shouldn't get stuck

Your Environment

CouchDB version used: 3.4.2
Browser name and version: irrelevant
Operating system and version: ubuntu 22.04 (couchdb compiled with dockerfile below)

FROM ubuntu:22.04

# Create app directory
WORKDIR /root

# Install dependencies
RUN apt-get update
RUN ln -fs /usr/share/zoneinfo/Europe/Brussels /etc/localtime
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata
RUN dpkg-reconfigure --frontend noninteractive tzdata
RUN apt-get install -y gnupg2 wget curl vim git build-essential pkg-config libicu-dev libmozjs-78-dev libcurl4-openssl-dev libncurses-dev node-gyp npm libssl-dev help2man openjdk-21-jdk-headless
RUN git clone https://github.com/erlang/otp otp_src_27.1.2
WORKDIR /root/otp_src_27.1.2
RUN git checkout -b 27.1.2 44ffe8811dfcf3d2fe04d530c6e8fac5ca384e02
RUN bash -c 'export ERL_TOP=`pwd`; export LANG=C; ./configure; make; make release_tests; cd release/tests/test_server; /root/otp_src_27.1.2/bin/erl -s ts install -s ts smoke_test batch -s init stop; cd ..; tar zcvf otp-tests.tgz test_server; cd /root/otp_src_27.1.2; make install'
WORKDIR /root
RUN git clone https://github.com/apache/couchdb.git #9
WORKDIR /root/couchdb
RUN git checkout -b 3.4.2 6e5ad2a5c5479cb09722b4a7d13b3d59b7bb2a23
RUN bash -c 'curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash ; . /root/.nvm/nvm.sh; nvm install 18'
RUN ./configure  --disable-docs --spidermonkey-version 78 chdir=/media/data/src/couchdb
RUN bash -c '. /root/.nvm/nvm.sh; nvm use 18; make release'
WORKDIR /root/couchdb/rel
RUN tar zcvf couchdb.jammy-jellyfish.3.4.2.tgz couchdb

Additional Context

We can give you access to the infrastructure that causes the problem to happen if needed.

Thanks for reaching out, Antoine.

Do you see any crashes or errors in the logs?
QuickJS can build indexes a bit faster but that could mean putting more pressure on the disk or other resources. Do you see increased cpu, disk(I/O quota reaching a limit?) or memory usage during that time?
If your design documents or documents are large sometimes it's possible to hit the maximum stack size in the JS engine. For QuickJS that can be adjusted with memory_limit_bytes. Though in that case I'd expect to see repeated crashes in the logs with segfault or a memory limit error.
Before killing the process try to gather some stats if you an remsh in:
- That should give us some idea what the the process is running in the remsh erlang:process_info(Pid).
- These three commands will show the top 3 busiest process, message queue lengths and memory users:
```
recon:proc_window(reductions, 3, 5000).
recon:proc_window(message_queue_len, 3, 5000).
recon:proc_window(memory, 3, 5000).
```
Another option if you see you have reached the os_process_limit on any node with the number of couchjs processes, could consider increasing that a bit.

apache / couchdb