Closed jblom closed 2 years ago
related PR (far from finished): https://github.com/beeldengeluid/dane-asr-worker/pull/68/files
Update the code in the PR has been merged and tested successfully in the following ways:
Now the final and most important test is how it will perform on the long items from the "high prio" workflow (i.e. use-case-100-high-prio-items workflow, which use to fail right at the first item)
Shortly, I will rerun the first item of batch use-case-100-high-prio-items_0
and hopefully we'll see it succeed soon
Update I retried one of the items in the use-case-100-high-prio-items_0
batch, left it over night, but it seems to be still running. I noticed that the stdout of Kaldi is not logged (debug level, running code is configured to show info).
Now will redeploy with proper logging and try (another item) again. For this run I did not notice any zombie processes or dmesg errors, so that seems fine. Without Kaldi logging I really have no idea why it is running so long.
Update Ah I figured it out. The reason it fails is this rabbitmq exception occurring (already fixed by @gb-beng in the new rabbitMQ setup):
(406, 'PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out.
Timeout value used: 1800000 ms.
This timeout value can be configured, see consumers doc guide to learn more')
This exception causes the worker to stop/crash and k8s will restart it again after which the same ASR job is started again (now in the same container)
Will configure the proper timeout in the old cluster and retry
Update applied the same consumer_timeout
setting used in the new k8s cluster via a new configmap for the rabbitmq pod
Now we wait to see if the same item does finish in time
the new consumer_timeout
setting worked fine. This DANE document (with a 4 hour audio file) was processed successfully:
DANE doc ID = da831ecf70383a868ce46de3766bff0e5543868b
FILE = 2102203260336511231__2022032602RA1-RCR3000MVDF.mp3
Now to test the single export of this item, so that it is also available in the MS:
After manually triggering the ASR for the first batch (proc_batch_id
= use-case-100-high-prio-items_0
) in the high prio workflow I started the workflow again and the export went fine. In this program you can find the first two carriers of that first batch: https://mediasuite-test.rdlabs.beeldengeluid.nl/tool/resource-viewer?id=2102203260336511231&cid=daan-catalogue-aggr&st=2102203260336511231
Closing
Make sure the async mode implemented in the dane-asr-worker works. This to avoid timeouts