coreweave / tensorizer

Module, Model, and Tensor Serialization/Deserialization
MIT License
180 stars 25 forks source link

fix(stream_io): Finalize temporary files correctly on `__exit__` #72

Closed Eta0 closed 8 months ago

Eta0 commented 8 months ago

Standalone Object Storage Uploads & with

The following code using stream_io.open_stream() as a context manager had been broken, and didn't upload anything to object storage:

from tensorizer.stream_io import open_stream

with open_stream("s3://my-bucket/file.txt", "wb") as file:
    file.write(b"Content")
    # Close implicitly

Whereas with an explicit call to .close(), in a context manager or not, it worked correctly:

with open_stream("s3://my-bucket/file.txt", "wb") as file:
    file.write(b"Content")
    file.close()

The reason for this is that the finalizer for a temporary file backing a stream upload first checked that the file was not already closed before uploading, to avoid uploading twice if .close() was called twice (e.g. in the second snippet: .close() is called explicitly, and then implicitly at the end of the context). However, CPython's implementation of NamedTemporaryFile.__exit__() prior to Python 3.12 actually first closes the underlying file, and then calls the wrapper's NamedTemporaryFile.close() method, which caused the upload to be skipped entirely if the file closing was triggered only by a context ending without an explicit call to .close().

This bug generally did not affect TensorSerializer use cases, since TensorSerializer objects explicitly .close() their output files either immediately after serialization when calling TensorSerializer.close(), or when garbage collected, within TensorSerializer.__del__().

After the change to using weakref.finalize() to trigger uploads in commit 3fd7f1e, the upload callback became inherently idempotent, so the check that the file is not already closed is no longer necessary anyway. Thus, this change removes the check that the file is not already closed.

Additionally, changes to the NamedTemporaryFile implementation in CPython with the addition of the delete_on_close parameter in Python 3.12 (python/cpython#97015) had the side effect that a NamedTemporaryFile's .__exit__() method no longer calls its .close() method at all. Previously, the upload trigger was attached only to .close(), so this change also switches the finalizer to now be embedded as a hook in wrappers around both .close() and .__exit__() separately.

With both of these changes, the first example code snippet now works correctly in Python versions 3.8 through 3.12.

Eta0 commented 8 months ago

@sangstar

LGTM. Just so I'm clear, was it actually skipping _temp_file_closer entirely?

It was still calling _temp_file_closer(), but short-circuiting due to some logic in it and thus skipping the part of the function that would actually do anything.

Also not sure why it's a good idea for Python 3.12+ to not be calling NamedTemporaryFile.close() during .__exit__(). Not sure what prompted them to give it a parameter to do this instead.

NamedTemporaryFile in Python is a wrapper around a regular file object that holds an extra _closer functor-thing that closes and deletes the file (optionally, based on the delete parameter). In Python 3.12, they separated the delete parameter's functionality into delete: bool, delete_on_close: bool. The latter controls if .close() triggers a deletion, or if only .__exit__() will trigger a deletion. Allowing the file to close safely before deletion allows for more stuff like our use case, where we close the temporary file after writing, perform an action on it, and then delete it as a separate step, but it requires the file to be used as a context manager to clean itself up properly.

So the reason it doesn't call close() anymore is because the logic was shifted from .__exit__() calling .close() calling ._closer.close() into .__exit__() calling ._closer.cleanup() (new) and .close() calling ._closer.close() which optionally calls ._closer.cleanup(), depending on the value of delete_on_close:

              Before                             After
┌──────────────────────────────┐   ┌──────────────────────────────┐
│      _TemporaryFileWrapper   │   │      _TemporaryFileWrapper   │
│                              │   │                              │
│ .__exit__()────────►.close() │   │ .__exit__()         .close() │
│                            │ │   │ │                          │ │
└────────────────────────────┼─┘   └─┼──────────────────────────┼─┘
                             │       │                          │
┌────────────────────────────┼─┐   ┌─┼──────────────────────────┼─┐
│      _TemporaryFileCloser  │ │   │ │    _TemporaryFileCloser  │ │
│                            │ │   │ │                          │ │
│                            │ │   │ ▼         (optional)       ▼ │
│            .close()◄───────┘ │   │ .cleanup()◄── ─── ──.close() │
│        (close & delete)      │   │(close & delete)     (close)  │
└──────────────────────────────┘   └──────────────────────────────┘

Glad weakref.finalize is a handy workaround for this :)

Yep, weakref.finalize() makes its callback idempotent in addition to almost-guaranteeing (outside of cases like SIGKILL) that it runs before the interpreter shuts down. Perfect for our use case.

Is close() still called twice if open_stream is used as a context manager? I get that that's fine now that we're using an idempotent closing function, but why is calling close() twice necessary?

The stream is closed twice when you do this:

with open_stream(...) as file:
    ...  # Write some stuff
    file.close()  # First close
# file.__exit__() is triggered upon leaving the block, second close

It's normally not necessary to write this—though because of idempotence guarantees, it is perfectly safe to do so—and people should be doing either:

# No context manager
file = open_stream(...)
# Write some stuff
file.close()

Or:

# No explicit .close()
with open_stream(...) as file:
    ...  # Write some stuff

But the latter option was broken until now. However, the double close case can happen indirectly with tensorizer if you write:

with open_stream(...) as file:
    serializer = TensorSerializer(file)
    serializer.write_module(...)
    serializer.close()  # Internally calls `file.close()` in addition to some other finalizations

Where the serializer calls file.close() to support streams with no external reference, like a file opened automatically by passing a path to TensorSerializer, or this pattern:

serializer = TensorSerializer(
    open_stream(
        path_uri=...,
        mode="wb",
        s3_access_key_id=...,
        s3_secret_access_key=...,
        s3_endpoint=...,
    )
)

Since those cannot otherwise be closed by the calling code.