AcademySoftwareFoundation / rez

An integrated package configuration, build and deployment system for software
https://rez.readthedocs.io
Apache License 2.0
932 stars 330 forks source link

Add ability to cache package payloads synchronously #1679

Closed isohedronpipeline closed 2 months ago

isohedronpipeline commented 5 months ago

Fixes #1379

Carrying on the work from https://github.com/AcademySoftwareFoundation/rez/pull/1452

This adds support for synchronous/blocking package caching during the resolve process.

linux-foundation-easycla[bot] commented 5 months ago

CLA Signed

The committers listed above are authorized under a signed CLA.

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 49.09091% with 28 lines in your changes missing coverage. Please review.

Project coverage is 58.39%. Comparing base (153bd87) to head (355a891). Report is 38 commits behind head on main.

Files Patch % Lines
src/rez/package_cache.py 51.21% 17 Missing and 3 partials :warning:
src/rez/cli/env.py 25.00% 4 Missing and 2 partials :warning:
src/rez/resolved_context.py 80.00% 0 Missing and 1 partial :warning:
src/rez/rezconfig.py 0.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1679 +/- ## ========================================== + Coverage 58.03% 58.39% +0.36% ========================================== Files 127 126 -1 Lines 17069 17205 +136 Branches 3496 3519 +23 ========================================== + Hits 9906 10047 +141 + Misses 6496 6489 -7 - Partials 667 669 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

JeanChristopheMorinPerso commented 5 months ago

I took a deeper look, and as things are in this PR, it's doesn't work as you would expect. If you were to rez-env maya --pkg-cache-mode async and then rez-env maya --pkg-cache-mode sync and initial cache (when we ran with async) takes long enough, the second rez-env (in sync mode) will skip the maya package. See https://github.com/AcademySoftwareFoundation/rez/blob/main/src/rez/package_cache.py#L210-L219. Basically, the package cache will skip any variant that is marked as in-progress. This would result in a non-localized environment.

Additionally, I still have to verify, but I think that we should really avoid using --daemon when caching synchronously. If we use --daemon, it would mean that cancelling rez-env --pkg-cache-mode sync would not cancel the caching process. After all, keeping the process alive is the the reason to have a daemon mode in the first place. Or I'd go even further, we should not use a subprocess.

We should also consider adding a progress bar or some logs when using the sync mode. I anticipate that users will complain if rez-env maya houdini blocks for multiple minutes without clearly saying what's going on.

isohedronpipeline commented 5 months ago

I took a deeper look, and as things are in this PR, it's doesn't work as you would expect. If you were to rez-env maya --pkg-cache-mode async and then rez-env maya --pkg-cache-mode sync and initial cache (when we ran with async) takes long enough, the second rez-env (in sync mode) will skip the maya package. See https://github.com/AcademySoftwareFoundation/rez/blob/main/src/rez/package_cache.py#L210-L219. Basically, the package cache will skip any variant that is marked as in-progress. This would result in a non-localized environment.

I think it would be preferable to call add_variant(force=True) rather than poll until the .copying sentinel file has disappeared, but I'm not sure what to do about the file lock. We really don't want to be copying the same path at the same time. Perhaps we should modify this logic to copy to an intermediate location that does an atomic move operation once the copy has completed to avoid errors with multiple cache operations running concurrently. I'm not sure there is a great solution for mixing sync and async caching operations for the same files at the same time.

Additionally, I still have to verify, but I think that we should really avoid using --daemon when caching synchronously. If we use --daemon, it would mean that cancelling rez-env --pkg-cache-mode sync would not cancel the caching process. After all, keeping the process alive is the the reason to have a daemon mode in the first place. Or I'd go even further, we should not use a subprocess.

Yeah, I'm fine with that. I'll move some of the logic from the cli entry point into package_cache later today or tomorrow.

We should also consider adding a progress bar or some logs when using the sync mode. I anticipate that users will complain if rez-env maya houdini blocks for multiple minutes without clearly saying what's going on.

We can have something print out in the _while_copying() function that is threaded. Since we're using shutil.copytree I don't think we're going to have much by the way of stats about total size/amount transferred/estimated time to completion without hammering the disk, which I'd like to avoid.

JeanChristopheMorinPerso commented 5 months ago

I think it would be preferable to call add_variant(force=True) rather than poll until the .copying sentinel file has disappeared, but I'm not sure what to do about the file lock. We really don't want to be copying the same path at the same time. Perhaps we should modify this logic to copy to an intermediate location that does an atomic move operation once the copy has completed to avoid errors with multiple cache operations running concurrently. I'm not sure there is a great solution for mixing sync and async caching operations for the same files at the same time.

I'm not too sure why we would use force=True or why you don't want to poll. Can you explain your thought a bit more please? Like you say, we really don't want to copy twice or copy the same package multiple times in parallel. The code goes to great length to avoid this.

isohedronpipeline commented 5 months ago

I'm not too sure why we would use force=True or why you don't want to poll. Can you explain your thought a bit more please? Like you say, we really don't want to copy twice or copy the same package multiple times in parallel. The code goes to great length to avoid this.

You could be right. I am worried that a previous sync process could have failed, leaving behind the .copying file, but no further updates to it will have been made. I would imagine that one use case for doing a package cache syncronously would be to force all of the caching operations to happen right now as possibly a way to restart a locked or frozen cache operation.

I see that the copying file is updated every .2 seconds though, so we should be able to check that and continue to wait while that file is being updated.

My hesitation to hit the disk repeatedly is maybe misplaced since the cache is on the local disk. In studio environments, I hesitate to hit network drives with something that could be run many times in parallel. Of course you're right and starting multiple large file transfers instead is madness.

JeanChristopheMorinPerso commented 5 months ago

The package cache class already knows how to handle stalled packages. It should be able to handle the case where a variant is left in progress due to an error or whatever else and restart it. See https://github.com/AcademySoftwareFoundation/rez/blob/dc2c77757748e87d2abf0967924ee0195aa0f11d/src/rez/package_cache.py#L830-L835

As for polling, I agree that polling a shared file system is not wanted. Though, caching is usually done on local disks. We could maybe use inotify to optimize ans simplify the process.

chadrik commented 3 months ago

Hi @JeanChristopheMorinPerso I think all the outstanding issues have been addressed. Any other thoughts?

ameliendeshams commented 3 months ago

Hi @isohedronpipeline,

I'm delighted to see this pull request! We've been waiting for this feature, and it's great to see it implemented.

At Fortiche, we've tested this branch, and it works exactly as expected. The synchronous package caching is a game-changer for our studio environments, where we often need to cache packages quickly and reliably.

We're also happy to see that you've addressed the concerns about progress and logging in sync mode. It's essential to provide users with clear feedback when running long-running commands.

Thanks again for your work on this feature! We're excited to see it merged and start using it in our production environments.