Closed Eta0 closed 8 months ago
@sangstar
LGTM. Just so I'm clear, was it actually skipping
_temp_file_closer
entirely?
It was still calling _temp_file_closer()
, but short-circuiting due to some logic in it and thus skipping the part of the function that would actually do anything.
Also not sure why it's a good idea for Python 3.12+ to not be calling
NamedTemporaryFile.close()
during.__exit__()
. Not sure what prompted them to give it a parameter to do this instead.
NamedTemporaryFile
in Python is a wrapper around a regular file object that holds an extra _closer
functor-thing that closes and deletes the file (optionally, based on the delete
parameter). In Python 3.12, they separated the delete
parameter's functionality into delete: bool, delete_on_close: bool
. The latter controls if .close()
triggers a deletion, or if only .__exit__()
will trigger a deletion. Allowing the file to close safely before deletion allows for more stuff like our use case, where we close the temporary file after writing, perform an action on it, and then delete it as a separate step, but it requires the file to be used as a context manager to clean itself up properly.
So the reason it doesn't call close()
anymore is because the logic was shifted from .__exit__()
calling .close()
calling ._closer.close()
into .__exit__()
calling ._closer.cleanup()
(new) and .close()
calling ._closer.close()
which optionally calls ._closer.cleanup()
, depending on the value of delete_on_close
:
Before After
┌──────────────────────────────┐ ┌──────────────────────────────┐
│ _TemporaryFileWrapper │ │ _TemporaryFileWrapper │
│ │ │ │
│ .__exit__()────────►.close() │ │ .__exit__() .close() │
│ │ │ │ │ │ │
└────────────────────────────┼─┘ └─┼──────────────────────────┼─┘
│ │ │
┌────────────────────────────┼─┐ ┌─┼──────────────────────────┼─┐
│ _TemporaryFileCloser │ │ │ │ _TemporaryFileCloser │ │
│ │ │ │ │ │ │
│ │ │ │ ▼ (optional) ▼ │
│ .close()◄───────┘ │ │ .cleanup()◄── ─── ──.close() │
│ (close & delete) │ │(close & delete) (close) │
└──────────────────────────────┘ └──────────────────────────────┘
Glad
weakref.finalize
is a handy workaround for this :)
Yep, weakref.finalize()
makes its callback idempotent in addition to almost-guaranteeing (outside of cases like SIGKILL) that it runs before the interpreter shuts down. Perfect for our use case.
Is
close()
still called twice ifopen_stream
is used as a context manager? I get that that's fine now that we're using an idempotent closing function, but why is callingclose()
twice necessary?
The stream is closed twice when you do this:
with open_stream(...) as file:
... # Write some stuff
file.close() # First close
# file.__exit__() is triggered upon leaving the block, second close
It's normally not necessary to write this—though because of idempotence guarantees, it is perfectly safe to do so—and people should be doing either:
# No context manager
file = open_stream(...)
# Write some stuff
file.close()
Or:
# No explicit .close()
with open_stream(...) as file:
... # Write some stuff
But the latter option was broken until now. However, the double close case can happen indirectly with tensorizer if you write:
with open_stream(...) as file:
serializer = TensorSerializer(file)
serializer.write_module(...)
serializer.close() # Internally calls `file.close()` in addition to some other finalizations
Where the serializer calls file.close()
to support streams with no external reference, like a file opened automatically by passing a path to TensorSerializer
, or this pattern:
serializer = TensorSerializer(
open_stream(
path_uri=...,
mode="wb",
s3_access_key_id=...,
s3_secret_access_key=...,
s3_endpoint=...,
)
)
Since those cannot otherwise be closed by the calling code.
Standalone Object Storage Uploads &
with
The following code using
stream_io.open_stream()
as a context manager had been broken, and didn't upload anything to object storage:Whereas with an explicit call to
.close()
, in a context manager or not, it worked correctly:The reason for this is that the finalizer for a temporary file backing a stream upload first checked that the file was not already closed before uploading, to avoid uploading twice if
.close()
was called twice (e.g. in the second snippet:.close()
is called explicitly, and then implicitly at the end of the context). However, CPython's implementation ofNamedTemporaryFile.__exit__()
prior to Python 3.12 actually first closes the underlying file, and then calls the wrapper'sNamedTemporaryFile.close()
method, which caused the upload to be skipped entirely if the file closing was triggered only by a context ending without an explicit call to.close()
.This bug generally did not affect
TensorSerializer
use cases, sinceTensorSerializer
objects explicitly.close()
their output files either immediately after serialization when callingTensorSerializer.close()
, or when garbage collected, withinTensorSerializer.__del__()
.After the change to using
weakref.finalize()
to trigger uploads in commit 3fd7f1e, the upload callback became inherently idempotent, so the check that the file is not already closed is no longer necessary anyway. Thus, this change removes the check that the file is not already closed.Additionally, changes to the
NamedTemporaryFile
implementation in CPython with the addition of thedelete_on_close
parameter in Python 3.12 (python/cpython#97015) had the side effect that aNamedTemporaryFile
's.__exit__()
method no longer calls its.close()
method at all. Previously, the upload trigger was attached only to.close()
, so this change also switches the finalizer to now be embedded as a hook in wrappers around both.close()
and.__exit__()
separately.With both of these changes, the first example code snippet now works correctly in Python versions 3.8 through 3.12.