Closed demetris-manikas closed 1 month ago
The implementation for the getEvent
is not clear at all to me since it is called from two places.
The one from workflow
is straight forward but how should the dbos_executor
generate the timeoutFunctionID ?
Suggestions?
Also I don't see what changes the debug_workflow
needs. There is no timeout actually in use there
The implementation for the
getEvent
is not clear at all to me since it is called from two places. The one fromworkflow
is straight forward but how should thedbos_executor
generate the timeoutFunctionID ? Suggestions?
We only need to make sure the sleep is durable for getEvent
if it's called from workflow
. When it's called from dbos_executor
the function ID is always undefined
and nothing is persisted in the database. So if the implementation would be similar to recv
, but will fall back to use a normal non-durable sleepms
if the function ID is undefined.
Also I don't see what changes the
debug_workflow
needs. There is no timeout actually in use there
In debug_workflow
, we still need to increment the function ID twice. Otherwise, downstream functions in debug mode will have their function IDs offset by 1.
While trying to implement the proposed solution to getEvent
the system breaks from the generation of the extra functionID.
Any ideas on why is it so?
While trying to implement the proposed solution to
getEvent
the system breaks from the generation of the extra functionID. Any ideas on why is it so?
Can you provide more information on this?
I think it's also better to put the getEvent
implementation in a separate PR.
OK about the getEvent. This should be ready then. One last question. Reading the code I have the feeling that if the operation times out then no database access should occur. Is that so? If yes it could be implemented by rejecting the Promise after the sleep call with a custom OperationTimedOutError and by adding a catch in the Promice.race to return null if the error is an OperationTimedOutError instead of hitting the db.
OK about the getEvent. This should be ready then. One last question. Reading the code I have the feeling that if the operation times out then no database access should occur. Is that so? If yes it could be implemented by rejecting the Promise after the sleep call with a custom OperationTimedOutError and by adding a catch in the Promice.race to return null if the error is an OperationTimedOutError instead of hitting the db.
If the operation times out we will still need to record the output as null
. The reason is that we want to make sure the behavior is deterministic when we retry or debug workflows. If the first execution returned a null
then the following retries will see the recorded output in the database and skip it.
Changes of getEvent
reverted.
The output can still be recorded as null avoiding execution of the following query when the operation has timeout.
const finalRecvRows = (await client.query<notifications>(
`WITH oldest_entry AS (
SELECT destination_uuid, topic, message, created_at_epoch_ms
FROM ${DBOSExecutor.systemDBSchemaName}.notifications
WHERE destination_uuid = $1
AND topic = $2
ORDER BY created_at_epoch_ms ASC
LIMIT 1
)
DELETE FROM ${DBOSExecutor.systemDBSchemaName}.notifications
USING oldest_entry
WHERE notifications.destination_uuid = oldest_entry.destination_uuid
AND notifications.topic = oldest_entry.topic
AND notifications.created_at_epoch_ms = oldest_entry.created_at_epoch_ms
RETURNING notifications.*;`,
[workflowUUID, topic])).rows;
but I guess it is not big of a deal. If you don't wish to implement it for any reason no problem on my side. It is a minor improvement and maybe not worth of the extra complexity in the code. Thanks a lot for your guidance.
I will add the tests during the next days. It's getting late here..
Changes of
getEvent
reverted. The output can still be recorded as null avoiding execution of the following query when the operation has timeout. but I guess it is not big of a deal. If you don't wish to implement it for any reason no problem on my side. It is a minor improvement and maybe not worth of the extra complexity in the code. Thanks a lot for your guidance.
I see what you mean, but I think let's not add the extra complexity here.
Test added. Also changed the expected values.
1500 is arbitrary.
2000 is the sleep value passed so test against it.
Finally a 1000ms for something that should return almost immediately seems too much so I changed it to 200 which in turn is again big (on my PC tests pass even with a value of 10ms) so maybe 50 would be a proper value.
This PR aims to address #451. I only implemented durable timeout in the
recv
method for now so as to get your feedback before implementing the functionality in thegetEvent
method.