KhronosGroup / NNEF-Docs

NNEF public repository
Apache License 2.0
14 stars 3 forks source link

Tensor File Format #2

Closed zoeoz closed 6 years ago

zoeoz commented 6 years ago

I would like to present a few thoughts about the Tensor File Format described in Section 5.2 for consideration.

I expect the rationale for storing a quantization algorithm string in the file is probably motivated by the fact that syntax for quantization operations is already defined in Section 4.8.5 as part of the definition of the grammar for the computational graph. The assumption here appears to be that whatever system component of the implementation that may read the tensor file will also be the same system component that reads the computational graph. In other words, the assumption seems to be that the syntax parser will be available to parse the quantization algorithm string out of the tensor file in order to understand the contents of the tensor data.

I think this unnecessarily creates a dependency between system components. Ideally, I’d prefer to see a separation of concerns between system components that are required to read the tensor file vs. the computational graph, reflected in the specification of the two separate formats.

If the tensor file format used binary fields to store parameters of the quantization algorithm, then the system component for reading tensor file data can be designed and implemented without requiring a syntax parser, a dependency that makes an otherwise very simple and lightweight tensor file reader, for example, suddenly very complex.

Related to this issue is the offset to the actual data. The second 4-bytes of the format specify this offset, even though the size of the header can be determined. I’m assuming the reason for explicitly specifying the offset is because there may be padding between the end of the header and the beginning of the actual data. The specification doesn’t mention this, but it seems to be implied.

Since the tensor file format is specifically designed to efficiently store large amounts of data in a compact binary format, it would be nice if the format was also “friendly” to an implementer using memory mapped files to access the data. Memory mapping, though, may be difficult or inefficient if the tensor data can begin at any byte offset in the file. For example, if the tensor data is 32-bit floating-point, at a minimum it would be best for the tensor data to be aligned to at least a 4 byte boundary.

The current specification seems to allow this, since the offset to the tensor data must be explicitly specified. However, the specification also doesn’t provide any specific alignment requirements, so in general a system component designed for memory mapping the tensor file data couldn’t rely on this.

I suggest it may be worth considering an adjustment to the specification to accommodate efficient memory mapping. For example, requiring that the actual tensor data be aligned to a byte offset that is compatible with the underlying data type. Another alternative is to simply require the actual tensor data begins with the first byte of the file, and the “header” information is actually placed at the end of the file as a “footer.” A reader can then seek the end of the file to extract the footer, and then memory map the actual tensor file data beginning at the first byte of the file, which will be a convenient alignment for memory mapping.

gyenesvi commented 6 years ago

Indeed, the string format for the quantization algorithm creates a dependency between the parser and the component that processes the binary data, which would ideally be eliminated. However, the rationale behind this was that the quantization algorithm may be quite varied and custom defined as well, and the string description would be flexible enough to accommodate such a variety, while a few binary fields may not suffice. The idea was that any custom quantization algorithm may be defined as a compound operation (in the structure definition file, just like other custom compound ops), and the data files may refer to those custom operations for quantization. It may turn out however that such a flexibility is not required or is not worth the complexity introduced by the dependency on the parser. These are exactly issues on which feedback is required, so let me know what you think about it considering the flexibility for custom quantization algorithms.

As for the offset to the data: it is required because there are two fields in the header which may vary in length. One is exactly the quantization string, the other is the shape of the tensor (only relevant dimensions are included); however this latter would be relatively easy to make fixed, by say, at maximum 8 dimensions, and filling the trailing dimensions with 1s (but we did not do this because the header was variable length already). Otherwise, the header is aligned to 4 bytes always (all fields are either multiples of 4 bytes or can be grouped to 4 bytes), so the actual data starts at a 4-byte offset, as you say would be beneficial.

It must be noted that we did not specifically want to optimize the format for fast loading, because the goal of NNEF is to be an exchange format, not a deployment format, and the assumption is that the data will be converted to vendor specific formats anyway as an offline process, and that vendor specific data will be possible to load fast. Or do you have any other reason to use memory mapping?

Making the file format relatively easy to process while not compromising flexibility is always a desirable design goal.

zoeoz commented 6 years ago

I think you summarize the question surrounding the quantization algorithm string very well by asking if the flexibility provided is actually required, and is it worth the added complexity of making the tensor file format dependent on the syntax parser. My experience with neural networks is primarily on the training side, so I think it will be helpful if other users doing inferencing can provide real-world use cases to support leaning in one direction or another.

Regarding dimensions of the data, I think the current mechanism that specifies the number of dimensions and then provides a variable number of fields for each dimension is sufficient. Keep in mind that when memory mapping files, it is not any problem to have a header that may vary in size (and in retrospect I probably should have started two separate topics for this reason, because the question surrounding the use of the quantization algorithm string is irrelevant to the question about memory mapping). When the file is memory mapped, for example, the first byte of the file will be aligned to a memory address that is an integer multiple of the operating system’s page size. So if the offset to the tensor data is known, it is trivial to increment this memory address to the beginning of the actual tensor data. The only problem is when the offset to the tensor data, for example, is not aligned to some integer multiple of the underlying size, in bytes, of the data type of the tensor, since in that case it can cause unaligned loads.

I understand NNEF is not primarily designed as a deployment format and that it might not be possible for all implementations to use memory mapping. However, making the format “friendly” to memory mapping would only require defining, at a minimum, that the offset to the tensor data be aligned to a byte offset in the file that is some integer multiple of the size, in bytes, of the underlying data type of the tensor. In my view, this would be a small concession that would not negatively impact the standard as an exchange format.

gyenesvi commented 6 years ago

Okay, I understand now what your requirement would be for memory mapping. At least I understand that starting the actual data at an offset which is the multiple of the item size is required for pointer arithmetic. But what would you require for data that is less than a byte per item? The current format allows any bit width per item, like 5 bits, packed into a contiguous byte array. In that case is memory mapping useful? Anyway, even if it is not, it can be utilized for data which is at least one byte per item..

zoeoz commented 6 years ago

The pointer arithmetic facilitates loading data from the memory map into processor registers, but the processor is where the alignment requirements come from. For example, both Intel SSE and 64-bit ARM have special instructions for loading 128-bit (16-byte) registers from memory that is aligned to a 16-byte address. The data may represent four 32-bit floating-point numbers, two 64-bit floating-point numbers, four 32-bit integers, eight 16-bit integers, or sixteen 8-bit integers. Whatever the case, the first element of the data must align to a 16-byte offset. If the data is not aligned in memory to a 16-byte address, the processor throws an exception.

Intel AVX and AVX2 have similar 32-byte and 64-byte alignment requirements for 256-bit (32-byte) and 512-bit (64-byte) registers, respectively.

The first byte of the tensor file will always be mapped by the operating system to a memory address that is an integer multiple of the operating system’s page file (usually 4K). So the beginning of the file always falls on an aligned address, regardless of the processor make or model. This means if the offset, in bytes, to the tensor data falls on one of the alignment boundaries required by a particular processor, for example, then the processor can load the data into registers very efficiently without any exceptions, regardless of the underlying tensor data type.

So it is sufficient to say the second 4-bytes of the tensor file format, which store the offset to the tensor data, is an integer multiple of P, where P is the required alignment boundary, in bytes, that is compatible with the processor. This means the offset to the tensor data may be larger than the size of the header, i. e., there may be some unused bytes between the end of the header and the beginning of the tensor data in order to ensure the tensor data begins on a required byte offset.

The main question is, what value of P should the standard specify? At a minimum, I would recommend P is equal to 16 bytes, since this at least guarantees the 128-bit Intel and ARM instructions for loading registers can always be used, regardless of the tensor data type. Considering a larger value of P, such as 32 or 64, would guarantee that a broader class of processor instructions (Intel AVX and AVX2, in this case) can also always be used. Considering an even larger value of P, such as 128, 256, or even 1024 would accommodate all available processors and provide compatibility with future advancements in processor technology.

Perhaps the only argument against a larger value of P, such as P=1024, is this may lead to extra padding between the end of the header and the beginning of the tensor data. However, even with P=1024, the size of the padding compared to the size of a typical tensor file, I expect would be very small.

As for data types smaller than 8-bits, I think only the new generations of custom AI accelerators may have direct support for them. The above scheme would at least guarantee various existing processor instructions (SSE, ARM, AVX, AVX2, etc.) could be used to load the data very efficiently, and the alignment might be beneficial to new custom AI accelerators too. I can’t be sure. My main observation is that enforcing an alignment constraint on the tensor data offset is a relatively simple and unobtrusive way to make NNEF “friendly” to memory mapping for a wide and broad a range of machines and implementers, keeping in mind it might not be suitable for every particular deployment implementation.

gyenesvi commented 6 years ago

Thanks for the details on this, the requirement seems clear now and agree that enforcing the alignment constraint does not effect the file format in any negative way, so it can be easily included.

gyenesvi commented 6 years ago

The tensor file format has been greatly reworked in the final version of the spec (recently released), let me know what you think.

zoeoz commented 6 years ago

Yes, I really like the new tensor file format. This is very friendly to memory-map implementation. Also, the changes that specify the quantization algorithm information without use of the NNEF grammar/syntax is a helpful separation of concerns from a systems architecture and implementation standpoint, since the system component that manages the tensor file I/O won’t require a syntax parser.

The only new thought I had, is that the 32-bit unsigned integer fields in the header will limit the size of tensor data to less than 4 GB. Probably not a practical concern anytime soon, but 64-bit fields would future-proof it.

gyenesvi commented 6 years ago

We did think about whether we need 64 bit fields, but figured that a single tensor is unlikely to get bigger than 4GBs (even whole networks are much less than that and the trend is to shrink them further). However, if we ever hit that barrier, we still have the option to increase the version of the binary format and redefine it with 64 bit fields.

zoeoz commented 6 years ago

Ah, yes. Bumping the version number is fine solution if it ever becomes an issue. I agree it will be unlikely.