apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[Format][C++] Recommended/required value for ArrowDeviceArray.device_id int in case of CPU data #40801

Closed jorisvandenbossche closed 6 months ago

jorisvandenbossche commented 7 months ago

In https://github.com/apache/arrow/pull/40717#pullrequestreview-1958505514, @paleolimbot noted that the current implementation of the C Device interface in Arrow C++ is using a device_id of -1 when exporting CPU data (while nanoarrow is using 0).

The spec itself doesn't say anything about this (https://arrow.apache.org/docs/format/CDeviceDataInterface.html#c.ArrowDeviceArray.device_id):

Mandatory. The device id to identify a specific device if multiple devices of this type are on the system. The semantics of the id will be hardware dependent, but we use an int64_t to future-proof the id as devices change over time.

Should we recommend or require a specific value for this case?

I noticed that the DLPack specification, from which the device type/id structure was taken, does specify to use 0 (https://dmlc.github.io/dlpack/latest/c_api.html#c.DLDevice.device_id):

The device index. For vanilla CPU memory, pinned memory, or managed memory, this is set to 0.

Should we adopt the same guideline for our spec?

cc @pitrou @zeroshade @kkraus14

zeroshade commented 7 months ago

Since the device_id should never be utilized or have any meaning in the case of CPU memory I'm fine with it either being unspecified or being explicitly specified to be 0. In my opinion it's an arbitrary decision / value

pitrou commented 7 months ago

Perhaps we could add something like: "If the device type does not have a notion of device id, it is recommended to return 0 as a convention".

kkraus14 commented 7 months ago

Keeping aligned with dlpack and using 0 probably makes the most sense but there is some downsides.

If we wanted to optionally denote which NUMA node CPU memory is on for example, being able to use numa node 0 would be good. Using -1 as the default would be more ideal in this situation.

For managed memory, if that memory has been touched on a specific GPU then it would be local to that GPU and while it could be accessed by any device, it would be expensive to do so where it would be ideal to use an actual device id for that. If it hasn't been touched yet then again something like -1 would be ideal to indicate that it's not known if the memory is local anywhere.

zeroshade commented 7 months ago

Given @kkraus14's comments, I think we'd be better served with recommending -1 then

raulcd commented 6 months ago

My understanding is that this is a doc clarification, should we add it for 16.0.0?

paleolimbot commented 6 months ago

I'll give it a go in the next few minutes!

raulcd commented 6 months ago

Issue resolved by pull request 41101 https://github.com/apache/arrow/pull/41101