aai-institute / lakefs-spec

An fsspec implementation for the lakeFS project.
http://lakefs-spec.org/
Apache License 2.0
37 stars 4 forks source link

Allow usage of ephemeral branches for transactions #253

Closed AdrianoKF closed 5 months ago

AdrianoKF commented 5 months ago

What is the motivation and/or use case?

The current implementation of transactions provides a somewhat lackluster guarantee of atomicity: while it guards against exceptions that arise from the suite of the transaction, it does not provide the same for errors that arise from the versioning operations themselves when exiting the context manager. This can lead to partial results already visible in the repository, with no way to roll them back easily (even manually, since, e.g., commits cannot simply be discarded but rather must be reverted in a new commit).

Instead, we should make it easier to follow versioning best practices by providing an easy way to perform the versioning operations in a transaction on an ephemeral branch (which is automatically merged upon successful completion of the transaction).

How can we implement this feature?

I already built a hacky prototype a while ago that adds a bit of state to the transaction:

Since the fsspec architecture does not allow us to pass constructor args to the transaction, instead I turned the transaction into a callable so it can be parametrized when creating the context manager:

with fs.transaction(repo="...", base_branch="main", automerge=True) as tx:
  # ... any versioning ops ...
  tx.commit(tx.repo, tx.branch, "did something")

Internally, the implementation reuses the versioning ops queue by pretending the creation of the ephemeral branch as the first operation in __enter__ and appending the merge to the base branch in __exit__.

Client API considerations

The above design would be fully backward compatible: usage of the ephemeral branches is fully optional, and existing code would not change its behavior.

On the other hand, it can be quite verbose when client code has multiple transactions, since the same options (repo, base branch, ...) need to be passed everywhere. Instead, these could also be turned into constructor arguments on the filesystem itself (and in fact, there already is source_branch, which is currently used for the create_branch_ok functionality, but semantically also fits the merge target for the ephemeral branches).

Having the state stored (in either the transaction or the FS instance) would allow us to simplify the tx versioning ops by making the repo and branch arguments optional. When they are not passed, the operations would simply default to the active branch in the transaction:

with fs.transaction(repo="...", base_branch="main") as tx:
  # ... other versioning ops ...
  tx.commit(message="did something")  # no repo/branch necessary, defaults to the ephemeral branch

To be even more concise, however, with some potential for ambiguity, the fs operations that work on rpaths could be modified to accept relative rpaths: basically the same as an ordinary rpath, but without the need to pass the {repo}/{branch}/ prefix. This would constitute a deviation from the current correspondence between rpaths and lakeFS URIs (with or without the scheme) and would create an ambiguity: it would be impossible to distinguish the (absolute) rpath data/main/1.txt (in a repo data with a branch main) from a relative rpath with the same name, but referring to a file data/main/1.txt in the current repo and branch.

Maciej818 commented 5 months ago

This ticket fixes #259 .