ballerina-platform / ballerina-lang

The Ballerina Programming Language
https://ballerina.io/
Apache License 2.0
3.68k stars 753 forks source link

[Task]: Extending Ballerina's Transaction Support to Include Transaction Recovery #42031

Open dsplayerX opened 10 months ago

dsplayerX commented 10 months ago

Description

Ballerina doesn't have native support for recovery in distributed transactions. It offers recovery only for database transactions utilizing the Atomikos library's transaction manager but lacks the support for transactional microservices or other XA resources. The goal of this task is to extend Ballerina's transaction support to include native recovery functionality for distributed transactions, according to the XA spec, eliminating the need for the Atomikos library. It aims to mitigate risks from network failures, resource manager issues, and application errors, ensuring data consistency, fault tolerance, and overall application reliability in distributed transactions.

Describe your task(s)

[Phase 1] Recovery for Direct XA Resource Transactions

[Phase 2] Coordinator-Participant Recovery

Related area

-> Compilation

Related issue(s) (optional)

No response

Suggested label(s) (optional)

No response

Suggested assignee(s) (optional)

No response

dsplayerX commented 10 months ago

Changes and New Additions

Recovery Pass

Update 23/01/2024

dsplayerX commented 9 months ago

Recovery Process

The recovery process involves retrieving failed transactions from the XAResources using xa_recover(). This would return a list of XIDs (transaction identifiers) for transactions that were in progress but failed to complete in that specific resource. Once we have these XIDs, we search for corresponding log records to determine the decision (commit/abort) that was previously made by the coordinator for each transaction and then act on it accordingly. This typically involves either committing or aborting the transaction, depending on the decision recorded in the logs. If there are mixed/hazard outcomes, the user is warned of those outcomes and those need to be manually handled.

Update 11/02/24

dsplayerX commented 9 months ago

The recovery process involves retrieving failed transactions from the XAResources using xa_recover(). This would return a list of XIDs (transaction identifiers) for transactions that were in progress but failed to complete in that specific resource. Once we have these XIDs, we search for corresponding log records to determine the decision (commit/abort) that was previously made by the coordinator for each transaction and then act on it accordingly. This typically involves either committing or aborting the transaction, depending on the decision recorded in the logs. If there are mixed/hazard outcomes, the user is warned of those outcomes and those need to be manually handled.

As discussed, retrieving prepared transactions from the database and matching them with corresponding log records to act based on the coordinator's decision was deemed unnecessary overhead.

Instead, we'll broadcast the coordinator's decision (commit/abort) to all resources. Resources without active or failed transactions for that XID will respond with XAER_INVAL or XAER_NOTA, indicating that the XID is no longer known to the resource and the transaction has terminated through a concurrent commit or rollback. This approach would streamline the process and minimizing unnecessary calls.

Update 12/02/2024