dmlc / dlpack

common in-memory tensor structure
https://dmlc.github.io/dlpack/latest
Apache License 2.0
890 stars 135 forks source link

Add `kDLBool` type #114

Closed leofang closed 1 year ago

leofang commented 1 year ago

Close #75. Close #76. Supersedes #76.

It turns out we all forgot @alonre24 has already pushed a PR (#76), so all I did was minor edits in the docstrings and sync with the latest master, with his commit preserved.

cc: @tqchen

alonre24 commented 1 year ago

Close #75. Close #76. Supersedes #76.

It turns out we all forgot @alonre24 has already pushed a PR (#76), so all I did was minor edits in the docstrings and sync with the latest master, with his commit preserved.

cc: @tqchen

LGTM Glad that this suggestion has finally approved :)

leofang commented 1 year ago

@tqchen shall we merge and make a new release (before #113 is merged)?

seberg commented 1 year ago

@tqchen was it correct that if we use the new Bool dtype with a size of 1-bit, than the storage would be assumed to actually be packed into a single bit? Or was the intention that bit-size is padded to full bytes?!

There is some need for passing bit-masks around in the dataframe community, so it would be good to clarify this use-case, if valid.

tqchen commented 1 year ago

in this case i think it should be ideally represented as Bool(bit=1, lanes=32), which represent a 32bit bitmask. The total size of the datatype can still be aligned to bytes

I am not too sure if we want to specify Bool(bit=1, lanes=1), since that can be machine dependent. On a 8bit-byte machine I guess we might still want to pad to minimum byte. But say if a machine have bit-level addressing then it would be onebit.

seberg commented 1 year ago

@tqchen to be honest, I have always had trouble understanding how lanes are to be used. If I have say a 1111 elements in my array and bits does it work to say that size=1111 but dtype=Bool(bit=1, lanes=8) (or something larger, since we mainly want to signal that its byte stored clearly)?

tqchen commented 1 year ago

The lane represents the lanes of the unit-data type.

Say we want to store a bit mas, which is represented by int32. To store 65 bits, we will need 3 integers, in this case, it is

array(dtype=Bool(bits=1, lanes=32), size=3)

in the low level, so we have 3 * 32 bits in total.

If we store bool as a normal byte, and to store 65 bools, we need

array(dtype=Bool(bits=8, lanes=1), size=65)

seberg commented 1 year ago

@tqchen but that is a problem, because how do I pass the 65 bits information there since 3 * 32 > 65 and the 65 is vital information!?

tqchen commented 1 year ago

@seberg I get what you mean. I feel that could be something being addressed by enhancing the array information to include sub-byte boundary information. I am mainly describing what is being interpreted from the spec right now in a way that is also mostly consistent with compilers like LLVM

seberg commented 1 year ago

I guess that the use-case I was asking for could use/abuse "lanes", because there is a side-channel to pass the actual shape. But it doesn't feel ideal to me, so I am wondering if we can think of a pragmatic way to make this possible. (To be honest, I have never seen "lanes" used, or dtypes relying on being padded to byte storage.)

seberg commented 1 year ago

Maybe it would make sense to either introduce a new dtype or some sort of flag for bitmasks?