Closed alonre24 closed 1 year ago
bool is encoded as UInt(bits=1, lanes=1)
type. One thing that is indeed true is that we have not clearly specified how to store the sub-byte types, in this case they are stored as the byte
As I understand, the size of bool element is usually one byte, so representing it with bits=1
seems not appropriate. On the other hand, using UInt(bits=8, lanes=1)
to represent bool
doesn't allow distinguish between bool
and uint8
types. Therefore, I suggest adding bool
to DLDataTypeCode
- both for having the flexibility of the desired bool representation (1 or 8 bits), and for not overlapping with existing types.
Thanks @alonre24 On compiler projects like LLVM, uint(1)
is used to represent the bool type(think it as a register type). That does not preclude us to think of a storage type of the element type. In this particular case, I believe specifying the storage type clearly for sub-byte types would be sufficient.
Thank you @tqchen, there seems to be different ways in which the different frameworks decide to implement bool. It is not guaranteed to be uint(1)
, for instance in PyTorch is uint(8)
, which makes it undistinguishable from a tensor with a uint(8)
dtype.
There are talks about creating a bit tensor in PyTorch to make bool sub-byte, but at the end of the day one has no guarantees there.
I'm of the opinion that having a separate DLDataTypeCode
for bool
and then using the bits
field to describe whether that bool is held in 1 bit or byte in the buffer a would make a lot of sense for interoperability.
It is not guaranteed to be
uint(1)
, for instance in PyTorch isuint(8)
, which makes it undistinguishable from a tensor with auint(8)
dtype.
Indeed, this issue would happen in MPI programs too (MPI_C_BOOL
vs MPI_UNIT8_T
).
What I originally meant was to specify the sub-bit storage layout, which means uint(1)
maps to the bool and stores as uint8 in DLPack.
When the bool was indeed stored as sub-byte in bitmask setting, it then becomes uint1x8
(8 of 1 bits). So we can still distinguish the two using the current convention.
I just realized in CuPy we do not support exchanging bool arrays because of lack of kDLBool
.
@tqchen I think this is a nontrivial interpretation that not everybody expects (nor was it documented). For example the ongoing NumPy PR did not support bool either (https://github.com/numpy/numpy/pull/19083). AFAIK only PyTorch supports it but it's unclear to me where they learned this. I am +1 for adding kDLBool
for better clarity.
cc: @rgommers @emcastillo @hameerabbasi @seberg
My view is that adding a new type for bool seems best (and clearest). Does PyTorch even really support it, or just export bool as a uint8 (losing the bool
type information on round-tripping)?
Thanks everyone for great discussion so far. Trying to summarize the thread.
i1
to represent boolWe can use uint(1)
to represent the intend of the data type. Additionally, the specification says that any sub-byte data type would need to be stored in a full byte. This means uint1 would need to be stored as uint8(full byte) consistent with the current behavior.
For cases where the intention is to pack the bit-masking(say pack 8 bools into a byte). The data type can be represented as a vector type uint1x8
, which can be stored as a full byte.
One of the natural extension to think about this is to introduce bool with bits=8. Note that the default printing behavior might not be too attractive here, since while we can say int8, bool8 does not makes too much sense as a dtype name.
Both A0 and A1 should achieve the goal of supporting bool in DLPack.
For cases where the intention is to pack the bit-masking(say pack 8 bools into a byte). The data type can be represented as a vector type uint1x8, which can be stored as a full byte.
I think this as storing exactly 8-tuples of bool, under the specifications? So a shape=1 [itemsize]
would translate into shape=8 [bits]
booleans/bits for uint1x8
?
This means:
SHAPE [bits]
has to be described as uint1xSHAPE
, with SHAPE=15
you get 15 [bits]
being uint1x15
?lanes
).Also, someone might get the idea to store uint2
in a packed/unpacked format, at which point the type distinction between boolean and bit-sized integers becomes fuzzy. That might be an unnecessary "what if", but creating bool
explicitly avoids it.
It seems to me that if there is a possibility of bit-sized stride support (e.g. in form of bit-arrays), creating the explicit bool dtype is a lot cleaner. If that will never happen, I still think it is nicer, but I doubt there are serious practical issues with using uint1
as bool.
I had the impression that the standard already explicitly supports bit-strides, since the header includes the line (in a comment):
size *= (t->dtype.bits * t->dtype.lanes + 7) / 8;
although, I guess there is a fair chance that nobody would do that and you can consider it undefined right now.
A0 is likely more consistent with the low level compiler implementations, for example LLVM implements bool using i1
Yes, but I don't understand that this is a reason for or against A0? The storage needs to be clearly defined as 1 or 0 (or possible value != 0
and value == 0
), of course. But the type-safety aspect is unrelated to what the compiler uses behind the scenes?
Thanks @seberg .
To followup on the conversation about the bitmasking. The particular comment of uint1x8
was with respect to the bit level case. In the most common cases we are not doing that, and uint1
plus the storage spec of sub-bit type would indeed resolve the issue(because all types padded to full byte to be stored). The same principle will likely applies to uint2 as well.
size *= (t->dtype.bits * t->dtype.lanes + 7) / 8;
is a realization of the sub-byte storage convention.
To followup on the multi-dimensional bit-array. We would indeed require some form of logical to physical mapping in this case. Explicitly specifying a vector type uint1x8
, or uint1x32
would corresponds to the physical storage plan of the data. Such plan is also usually easier to direct manipulate in the presence of SIMD. The logical case is a separate question.
In the case of A1, bool1
create some inconsistency with the current specification. Since the bool1
was intended as a bit array, while the sub-byte storage convention would indicate the other case(it should be stored as 1 byte). Of course as we discussed bit-array may be out of the scope, so not too relevant here. Just want to show some possible inconsistencies.
Yeah, you are right, I managed to mis-parse the code as rounding up the size rather than the itemsize :(. Although I guess many here likely didn't realize that sub-bit datatypes are standardized to be stored padded to full bytes :).
In the case of A1,
bool1
create some inconsistency with the current specification.Since the bool1 was intended as a bit array
You are right. Do not use bool1
for bit-strided storage, then (by itself)! What you get is:
bool8
indicates a convention where True := value != 0
and False := value == 0
bool1
indicates a convention where True := value == 1
and False := value == 0
stored in 1 byte as per current DLPack standarduint1
can be used for the set of values {0, 1}
instead of meaning {False, True}
(which could be surprising!)(For what its worth, NumPy can't always guarantee bool1
style, as it, annoyingly, supports both in practice. But DLPack could be opinionated about bool1
vs. bool8
. If you would switch to always using sub-byte strides, bool8
has to be specified as one of the two meanings.)
What about sub-byte strides?
I do agree that there is an inconsistency if you would use bool1
to implicitly indicate bit-strided (non byte-sized) storage, but not define uint2
to use this. But I do not understand that as an argument against introducing a new bool
dtype. To me that is primarily a type safety (and clarity) issue – i.e. the last point.
I have heard the bit-mask requests for NumPy often enough to think that it may be good to anticipate it (not necessarily act on that right now, though). The future might for example see a flag (e.g. on the dtype field) to indicate true bit-sized storage (non byte-padded). That flag will allow both bool1
and uint2
(or even uint1
) to be stored in a packed, sub-byte strided, manner.
@tqchen The discussion seems to have reached an impasse at the uint
vs bool
argument. https://github.com/dmlc/dlpack/issues/75#issuecomment-856241824 is a nice summary of the issue. Here is my opinion about the thread:
uint(1)
or uint(8)
as bool: This seems obscure to me as suggested in https://github.com/dmlc/dlpack/issues/75#issuecomment-856220450. Another argument against this is that there is no way to distinguish between uint8
and bool
. But @tqchen noted that it is more consistent with the lower level implementation in compilers e.g. LLVM.kDLBool
: More explicit and easier to use. Clear distinction between uint8
and bool
. Worst case, introduces some redundancy for low level compilers like LLVM (but that doesn't seem to be a problem (like introducing ambiguity), right?)I would vote for an explicit kDLBool
since its more explicit. I admit, I am not aware of the problems with low-level compiler projects like LLVM but I don't think that adding a kDLBool
should cause a problem when exporting at such low levels. @tqchen Do you agree with this? If yes, we can proceed with merging #76. It would definitely be nice to have the bool
data type in DLPack!
@tqchen @rgommers @seberg @kmaehashi @tirthasheshpatel @jakirkham @kkraus14
I'd like us to revisit this topic, and specially I'd like to request for adding kDLBool
. Look at the NumPy/CuPy/PyTorch adoptions as of today, it is clearly that none of these libraries support exchanging bool tensors, and it has increasingly become a source of bug reports (see the above linked issues). Given that these are already the 3 major Array API adopting libraries, we should add the bool type to clear any ambiguity and unblock them.
Thanks for rekindle this topic. After thinking a bit more on this topic. I also now agree that having kDLBool is going to be helpful.
I would assume the current state would call for something like
Bool(bits=8, lane=1)
to represent the current bool type in numpy/PyTorch?
Sounds good, it might be nice to review if anyone has expects True to be only the bit set, rather than anything != 0
.
seems there is general consensus on this. We can consider a PR to add such spec. perhaps @leofang can send a PR?
I think it makes sense to require true be the only bit set
Shoooot my apology @tqchen I completely missed the traffic here for some reason. Will do later today!
To represent tensors of type bool - is it possible adding to the
DLDataTypeCode
enum another type (kDLBool)? That can be very useful, as some well known machine learning platforms (TF, torch, Onnxruntime) support models whose inputs/outputs are tensors of type bool. Thanks!