coreweave / tensorizer

Module, Model, and Tensor Serialization/Deserialization
MIT License
180 stars 25 forks source link

[3.0-dev] Major refactor of serialization in prep for 3.0 #127

Closed bchess closed 3 months ago

bchess commented 5 months ago

(note this PR is just for merging into the v3.0.0-dev branch, not main)

file schema

The file schema has changed. Immediately following the metadata block is the full block of all header structures.

After the metadata and headers is the block of raw tensor data. The block of tensor data begins on a 4k page-aligned boundary (relative to start of file). Each individual tensor begins on an 8-byte-aligned boundary.

Basically:

file_header
metadata_size
metadata0
...
metadataN
header0
...
headerN
tensor0 (file_offset % 4096 == 0)
tensor1 (file_offset % 8 == 0)
...
tensorN (file_offset % 8 == 0)

We may want to combine the metadata and the header entries, but they are left separate for now for easier backwards compatibility.

hash computation

Previously, header hash computations included the hash fields themselves, presumed to be zeroed. This has been changed so the hash computation segments omit the hash fields. ~This breaks verify_hashes=True with 2.x files for now. We should prob fix prior to release~ (This is now fixed, we can deser 2.x files with verify_hashes)

bulk_write refactor

_bulk_write() has been broken up into a series of smaller functions that each perform some operation on the tensors. _WriteSpec carries the state of each tensor as it gets processed. Many ops are threaded and tracked as Futures (_WriteSpec.tensor_data_task). Dependent steps do Future chaining.

tensor_write()

tensor_write(), e.g. "incremental mode", is still supported. (It is implemented as a call to _bulk_write([tensor])). There are new tests to validate its functionality.

max_tensors

Because the new file schema must know ahead of time where to start writing tensor data, there must be a certain amount of pre-allocated space for the metadata and headers. Thus you must specify max_tensors to TensorSerializer.__init__ If you plan on callling tensor_write() more than once. This does a wild-ass-guess estimate on a good amount of header space to reserve. In some cases, like really long tensor names, it could still be an underestimate. You will get a RuntimeError() if you call write_tensor() and there isn't enough pre-allocated space.

You do not need to set max_tensors if you just call tensor_write() once.

Performance

Note that while these changes are mainly motivated to improve performance, I did not spend much time in this PR to measure and optimize performance. There may be regressions, but we should be able to resolve them (and beyond!) before release.

bchess commented 5 months ago

~there's some bug in the shared-data-encryption logic. Fun stuff. 🙃 I'm chasing it down~

done

harubaru commented 3 months ago

Feedback

The code looks pretty good and it also works pretty good too. I have tested this against various serialization tasks by serializing models for inference which did not result in a failure. The comments that I have left on this are mainly to get a better understanding behind the intent of some of the changes in this PR even if it was as little as changing an error return value.

That being said, the changes to the file schema that this PR provides are nice such as aligning tensor data on fixed boundaries or the decomposition of bulk_write(). Though, I am a bit curious about the addition of max_tensors. Is max_tensors a necessary argument to have because we support write_tensor()?

Additionally, it would be helpful to understand max_tensors more with a code sample that demonstrates its necessity, just like the code samples in the other docstrings.

Testing

A good portion of the changes in this PR and the PR's intent is changing the file schema. I have spent a good portion of my time reviewing this PR by testing the changes that were implement. The PR works with deserializing older 2.x artifacts and I have gathered benchmarking data from serializing and deserializing GPT-J 6B for 50 trials which did not show a significant performance difference between this PR and 2.9.0 (or at least: I don't think it is a significant difference.)

image

bchess commented 3 months ago

Accounting for File Range Insertion

~Imo its a fair bit of added code complexity, adds more constraints to what we can optimize, for little benefit. Object stores upload APIs have minimums to the chunk sizes, and you still need to assign the correct index of the block ahead of time -- you can't insert a block later.~

Thinking about this more, I maybe have presumed different from what you were proposing. I don't think we should add a public interface that allows for the mid-range insertion of a tensor. But I could see the benefit of using fallocate holes to optimize the writes of incremental tensors.

bchess commented 3 months ago

In terms of dev process:

There are a lot of good suggestions on changes and improvements, but this PR is already uncomfortably big. I think it's easier to make those changes as subsequent and individual PRs against the 3.x branch. I'd also like to empower others to be able to make contributions as well 😜

I can take point on writing up issues on all of the pending comments so that they don't get lost. For some of the spicier ones, we can also use the issues as separate threads of discussion on each topic. How does that sound?

bchess commented 3 months ago

Accounting for File Range Insertion

~Imo its a fair bit of added code complexity, adds more constraints to what we can optimize, for little benefit. Object stores upload APIs have minimums to the chunk sizes, and you still need to assign the correct index of the block ahead of time -- you can't insert a block later.~

Thinking about this more, I maybe have presumed different from what you were proposing. I don't think we should add a public interface that allows for the mid-range insertion of a tensor. But I could see the benefit of using fallocate holes to optimize the writes of incremental tensors.

Presuming this was what you intended, ticketed as https://github.com/coreweave/tensorizer/issues/159

Eta0 commented 3 months ago

Accounting for File Range Insertion

~Imo its a fair bit of added code complexity, adds more constraints to what we can optimize, for little benefit. Object stores upload APIs have minimums to the chunk sizes, and you still need to assign the correct index of the block ahead of time -- you can't insert a block later.~ Thinking about this more, I maybe have presumed different from what you were proposing. I don't think we should add a public interface that allows for the mid-range insertion of a tensor. But I could see the benefit of using fallocate holes to optimize the writes of incremental tensors.

Presuming this was what you intended, ticketed as #159

Yes—no, we don't need an interface that controls the actual tensor insertion point in the file—it would rather be useful for expanding the metadata section at the top of the file instead of going through hoops to determine its size or requiring people to specify max_tensors in advance. So I'm referring to setting up the entire metadata section as a chunk of indeterminate size, for object storage purposes.

Though, as a sidenote, you can add new chunks into the middle of a multi-part object storage upload, because part indices don't need to be consecutive (for anything compatible with AWS's object storage):

Parts upload

When uploading a part, in addition to the upload ID, you must specify a part number. You can choose any part number between 1 and 10,000. A part number uniquely identifies a part and its position in the object you are uploading. The part number that you choose doesn’t need to be in a consecutive sequence (for example, it can be 1, 5, and 14). If you upload a new part using the same part number as a previously uploaded part, the previously uploaded part is overwritten.

If we wanted, we could make the file header (start at) chunk 1 and the data section start at chunk 100, and have room to optionally insert up to 99 extension chunks before committing an upload. But we wouldn't need or want to.