aai-institute / lakefs-spec

An fsspec implementation for the lakeFS project.
http://lakefs-spec.org/
Apache License 2.0
37 stars 4 forks source link

Idempotency of demo notebook #189

Closed maxmynter closed 7 months ago

maxmynter commented 7 months ago

Closes https://github.com/aai-institute/lakefs-spec/issues/185 Closes https://github.com/aai-institute/lakefs-spec/issues/191

Replaces https://github.com/aai-institute/lakefs-spec/pull/186 to build on Jupytext integration

codecov[bot] commented 7 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (bcb7008) 87.36% compared to head (eb5b29c) 88.04%. Report is 2 commits behind head on main.

Files Patch % Lines
src/lakefs_spec/client_helpers.py 87.50% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #189 +/- ## ========================================== + Coverage 87.36% 88.04% +0.67% ========================================== Files 7 7 Lines 475 485 +10 Branches 67 70 +3 ========================================== + Hits 415 427 +12 + Misses 34 31 -3 - Partials 26 27 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

maxmynter commented 7 months ago

W.r.t. to tests, I am unsure on how to test the list_branches client helper as we basically only return the lakeFS-API response. Plus, if it breaks, it would also break the delete_branch test.

Thoughts on this?

AdrianoKF commented 7 months ago

W.r.t. to tests, I am unsure on how to test the list_branches client helper as we basically only return the lakeFS-API response. Plus, if it breaks, it would also break the delete_branch test.

Thoughts on this?

If you want to make sure we're calling the right lakeFS API with the expected params, you could mock the API client, other than that, not much we can test there.

maxmynter commented 7 months ago

@nicholasjng I am not sure that I understand your Review comment above, so correct my understanding when it is wrong.

But isn't reproducibility given by specifying commit SHAs? And tags are a human understandable aide. E.g. lakeFS gives the example of dev:jane-before-v2.3-merge.

Then it would be the individual data scientists responsibility to choose a unique and suitable tag. Once a tag is assigned, it should not be implicitly reassigned as a side effect to other code execution and only happen deliberately.

nicholasjng commented 7 months ago

But isn't reproducibility given by specifying commit SHAs? And tags are a human understandable aide. E.g. lakeFS gives the example of dev:jane-before-v2.3-merge.

Technically yes, but that is "post-hoc reproducibility": I was wondering why, with our current notebook setup, new commit SHAs are even needed at all. (I get the technical reasoning)

Is there an easy setup using branches that does not create a new commit anywhere? Probably branching off of main again for the second training after 2020 data has been pushed to main, right?

Then it would be the individual data scientists responsibility to choose a unique and suitable tag. Once a tag is assigned, it should not be implicitly reassigned as a side effect to other code execution and only happen deliberately.

Fair.

maxmynter commented 7 months ago

Is there an easy setup using branches that does not create a new commit anywhere? Probably branching off of main again for the second training after 2020 data has been pushed to main, right?

Yes that would work, I think.

What I thought about when writing the initial demo storyline, was to create a scenario where the tags has its most value. That is, imo, that we can tag a commit that is not the most recent on a branch. For the most recent, just the Ref to the branch suffices, whereas you need the commit id for older commits. This is easier done with a tag.

The only way to tag something that is unchanged is to have a single commit on a branch. This way, pushing files does not create a new commit, the tag refers to the old one and we get idempotent notebook execution without a hidden cleanup cell.

Quick aside: I just saw that create_tag takes str | Commit. It could take any Ref, also to a current branch tip. Should we extend this?

nicholasjng commented 7 months ago

Quick aside: I just saw that create_tag takes str | Commit. It could take any Ref, also to a current branch tip. Should we extend this?

It's probably low priority, since I assume that the Ref would almost always be a branch API result (or would you want to tag a tag?), and then you could always use the name directly.