coreweave / tensorizer

Module, Model, and Tensor Serialization/Deserialization
MIT License
153 stars 24 forks source link

[3.0] Canceled serialization should still do in-place decryption #144

Open bchess opened 2 weeks ago

bchess commented 2 weeks ago

@Eta0 Eta0 last week In-place decryption during serialization is extremely important in the error path: it must be absolutely assured to not leave tensor data left encrypted in-place. Because of that, you can't cancel all futures on an exception: the decryption futures must proceed if any encryption has occurred, and they must be sure to trigger even if their other dependent futures have been cancelled.

On a similar note, it needs to be guaranteed that self._maybe_decrypt_data can be called to queue up the decryption task in the first place, even if the main thread throws an exception at some point after calling self._prepare_for_write_encryption and queueing encryption but before the self._maybe_decrypt_data call would have happened normally. It would probably be good to add a field in _WriteSpec stating whether the tensor data is currently encrypted or not to handle cases where only some tensors were successfully encrypted before an error occurred, as decrypting them without them having been encrypted would also scramble their data.

We could probably add unit tests for this, too, e.g. by mocking the function that writes tensor data such that it always throws an exception, and then ensuring that the tensor data after serialization exits is the same as it was before it started.

@bchess bchess 2 days ago This is a good point. +1 to adding unit tests, because feels like something that could easily regress