[3.0] remove barrier in _prepare_for_write_encryption

From @Eta0

Regarding

            if w.tensor_data_task is not None:
                w.tensor_data_task.result(_TIMEOUT)
                w.tensor_data_task = None

Inside _prepare_for_write_encryption()

Eta0 last week This should not need to block the main thread on tensor_data_task. Do you need me to change the _crypt.ChunkedEncryption interface to make this possible? I can probably make an option to provide the buffer after constructing it, as long as the length is provided in advance.

@bchess bchess 2 days ago The dependency itself is for tensor_memory.nbytes which then informs chunk count. I think you're right that the whole block could be pulled out into a thread, but it gets a little ugly that we'd start setting other attributes (crypt_info and encryptor) of WriteSpec on a thread without a more sophisticated mutex design. Correct me if I'm wrong, but I don't think there's anything here that's CPU intensive and going to be able to release the GIL. And we'd need all to reconcile by the next step of _prepare_for_write_headers anyway.

@Eta0 Eta0 4 minutes ago nbytes can be calculated up-front in the _WriteSpec constructor as self.serialized_length = (tensor.element_size() * tensor.nelement()) & ~-(tensor.device.type == "meta") (at least, in the absence of compression, which would complicate more things than this), so this shouldn't need to wait for anything to get that datapoint. Adding synchronization barriers introduces issues with parallel usage of system resources (though you recently enough read my whole spiel on that!), so it's best to avoid them if possible.

@bchess bchess now Good point!

coreweave / tensorizer

[3.0] remove barrier in _prepare_for_write_encryption #160