dbos-inc / dbos-transact-ts

The TypeScript framework for backends that scale
https://docs.dbos.dev
MIT License
346 stars 22 forks source link

implement durable timeout in system database recv #552

Closed demetris-manikas closed 1 month ago

demetris-manikas commented 1 month ago

This PR aims to address #451. I only implemented durable timeout in the recv method for now so as to get your feedback before implementing the functionality in the getEvent method.

demetris-manikas commented 1 month ago

The implementation for the getEvent is not clear at all to me since it is called from two places. The one from workflow is straight forward but how should the dbos_executor generate the timeoutFunctionID ? Suggestions?

demetris-manikas commented 1 month ago

Also I don't see what changes the debug_workflow needs. There is no timeout actually in use there

qianl15 commented 1 month ago

The implementation for the getEvent is not clear at all to me since it is called from two places. The one from workflow is straight forward but how should the dbos_executor generate the timeoutFunctionID ? Suggestions?

We only need to make sure the sleep is durable for getEvent if it's called from workflow. When it's called from dbos_executor the function ID is always undefined and nothing is persisted in the database. So if the implementation would be similar to recv, but will fall back to use a normal non-durable sleepms if the function ID is undefined.

qianl15 commented 1 month ago

Also I don't see what changes the debug_workflow needs. There is no timeout actually in use there

In debug_workflow, we still need to increment the function ID twice. Otherwise, downstream functions in debug mode will have their function IDs offset by 1.

demetris-manikas commented 1 month ago

While trying to implement the proposed solution to getEvent the system breaks from the generation of the extra functionID. Any ideas on why is it so?

qianl15 commented 1 month ago

While trying to implement the proposed solution to getEvent the system breaks from the generation of the extra functionID. Any ideas on why is it so?

Can you provide more information on this? I think it's also better to put the getEvent implementation in a separate PR.

demetris-manikas commented 1 month ago

OK about the getEvent. This should be ready then. One last question. Reading the code I have the feeling that if the operation times out then no database access should occur. Is that so? If yes it could be implemented by rejecting the Promise after the sleep call with a custom OperationTimedOutError and by adding a catch in the Promice.race to return null if the error is an OperationTimedOutError instead of hitting the db.

qianl15 commented 1 month ago

OK about the getEvent. This should be ready then. One last question. Reading the code I have the feeling that if the operation times out then no database access should occur. Is that so? If yes it could be implemented by rejecting the Promise after the sleep call with a custom OperationTimedOutError and by adding a catch in the Promice.race to return null if the error is an OperationTimedOutError instead of hitting the db.

If the operation times out we will still need to record the output as null. The reason is that we want to make sure the behavior is deterministic when we retry or debug workflows. If the first execution returned a null then the following retries will see the recorded output in the database and skip it.

demetris-manikas commented 1 month ago

Changes of getEvent reverted. The output can still be recorded as null avoiding execution of the following query when the operation has timeout.

const finalRecvRows = (await client.query<notifications>(
      `WITH oldest_entry AS (
        SELECT destination_uuid, topic, message, created_at_epoch_ms
        FROM ${DBOSExecutor.systemDBSchemaName}.notifications
        WHERE destination_uuid = $1
          AND topic = $2
        ORDER BY created_at_epoch_ms ASC
        LIMIT 1
      )

      DELETE FROM ${DBOSExecutor.systemDBSchemaName}.notifications
      USING oldest_entry
      WHERE notifications.destination_uuid = oldest_entry.destination_uuid
        AND notifications.topic = oldest_entry.topic
        AND notifications.created_at_epoch_ms = oldest_entry.created_at_epoch_ms
      RETURNING notifications.*;`,
      [workflowUUID, topic])).rows;

but I guess it is not big of a deal. If you don't wish to implement it for any reason no problem on my side. It is a minor improvement and maybe not worth of the extra complexity in the code. Thanks a lot for your guidance.

demetris-manikas commented 1 month ago

I will add the tests during the next days. It's getting late here..

qianl15 commented 1 month ago

Changes of getEvent reverted. The output can still be recorded as null avoiding execution of the following query when the operation has timeout. but I guess it is not big of a deal. If you don't wish to implement it for any reason no problem on my side. It is a minor improvement and maybe not worth of the extra complexity in the code. Thanks a lot for your guidance.

I see what you mean, but I think let's not add the extra complexity here.

demetris-manikas commented 1 month ago

Test added. Also changed the expected values.

1500 is arbitrary.

2000 is the sleep value passed so test against it.

Finally a 1000ms for something that should return almost immediately seems too much so I changed it to 200 which in turn is again big (on my PC tests pass even with a value of 10ms) so maybe 50 would be a proper value.