Open tqchen opened 2 years ago
cc @rgommers @wookey @tirthasheshpatel @junrushao1994 @jermainewang @leofang @kkraus14 @szha @hawkinsp
what should downstream frameworks expect in the way of migration functions? should folks proposing an ABI-breaking change here also be on the hook to provide an implementation that produces a new DLTensor struct given an old one? it could be helpful to provide such implementation to clarify how to interpret such older structs.
Thanks @areusch this is indeed something that worth covering. The compatibility implication section covers some of that
__dlpack__info__
at which point the new fields needs to be filled properly.It is a bit ambiguous to call it ABI-breaking(due to S0) but would be good to list out all the implications
one concern with placing the version information at the end of the struct is that, in a V1 DLTensor, that field is invalid. therefore, the memory in that location could be set to any value. that begs the question whether we should require some magic, and how that magic should perhaps be computed.
i think adding the migration functions would help folks to consider these corner cases, which are not always obvious when writing these specs.
I believe what you said is covered by S1 part, which I agree that some of the migration function(likely filling in the value) would be useful
i'm not sure i follow--how can a C impl trust this memory, since it's outside the boundaries of a DLTensor
in v1 of this spec? even if __dlpack__info__
exists on the Python wrapper class, how could the impl know whether or not the DLTensor instance was produced by a V2+ framework (and therefore, trust the memory which holds the version info)?
The protocol is covered in the PR to implement this, as documented here (summarized below). As I understand the proposal, if the producer has no __dlpack_info__
, then by definition it is using V1: this is S0 above. If the consumer calls __dlpack__()
with no version, then the producer should return a V1 C structure for backward compatibility, at least for the next few years: this is S1 above. Otherwise, as described in the protocol, the consumer requests a specific dlpack version, which then dictates the C structure returned by the producer.
@mattip thanks for the pointer. those protocol docs make sense when Python is in the loop.
I'm thinking of the following scenario--it could be something that doesn't apply to dlpack, so please correct me if I'm wrong:
DLTensor*
as void*
)Now suppose the consumer is trying to work with the DLTensor created by the producer. Presumably this involves a flow in which a DLTensor*
is cast into DLTensorVersioned*
(e.g. via DLTensor*
(from an old library not compiled with an up-to-date dlpack) -> void*
(suppose this is done in order to pass via a PackedFunc in TVM-land) -> DLTensorVersioned*
).
When the consumer (or the consumer's wrapping Python DLTensor class) tries to access the received DLTensorVersioned*
, it would try to make a version check against version.dlpack
, but the producer allocated a smaller struct (sizeof(DLTensor)
), that memory is invalid.version.dlpack
could be random.
A way around this could be to require that all DLTensor
passed via C++ be allocated by the consumer, and the consumer could then zero-initialize the DLTensorVersioned
and fill in the requested version. However, then that drags in some complexity around requiring that sizeof(DLTensorVersioned)
only grows as version increases, so there are some drawbacks there too.
Independent of the discussion here, my suggestion would be to let this RFC gather feedback for 1-2 months before making changes of the code. An ABI break for such an important data exchange protocol is serious. Having enough feedback will ensure that there are no regrets.
@wjakob thanks for the suggestion! That is indeed the goal of this RFC(to ensure a long enough period of time before we commit as we value stability), I made it clear in the post
They are communicating by pointer which erases the type (e.g. PackedFunc in TVM-land passes DLTensor as void)
Can you point to this code? At some point the producer must decide to allocate and fill in the C struct. That is the point where there needs to be a protocol for specifying a dlpack version. If the architecture of the producer disallows any such protocol, how can dlpack ever change?
Edit: I think the term used to describe such situations is "dependency hell".
Some update after taking a closer look. Actually right now the proposed ABI did break S0 as well. The main reason was due to the fact that DLManagedTensor's deleter field seats right after a DLTensor. So any append of new fields would effectively change the location of deleter and break S0.
To further clarify the situation, let us think about the following possible changes to DLTensor fields:
Due to our favoring of stability ideally C1 should be extremely rare. And C0 ideally should maintain backward compatibility(the new struct can be used safely in old settings -- ignoring the fields).
There are a few choices if we consider the need that "C0 should not break" ABI compatibility
struct DLManagedTensor {
Version version;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
DLTensor dl_tensor;
};
Note that this is a ABI breaking change.
However, this allows future append to DLTensor fields without problem, so future C0 changes can be done in a backward compatible way.
struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
Version version;
new_flags here
};
This is a ABI backward compatible change. It does bring restrictions
struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
more flags
};
struct DLManagedTensorVersioned {
Version version;
DLManagedTensor managed_tensor;
}
Due to our favoring of stability. Ideally we would strongly favor S0 property (so frameworks do not need to bring extra support to keep up with the protocol), and the fact that DLPack rarely changes is an important property of the exchange protocol, of course, we do want to think about mechanisms to allow us to deal with breaking in case there is really need for it.
I am +1 for the A1 option. It is a minimalist option without severe backward-compatibility concerns. I have proposed this alternative in #105. @rgommers @mattip @seberg @leofang I'd appreciate your opinion on this!
I would tend towards creating the new name (i.e. A2 or variations of it). Otherwise, appending the fields requires out-of-bound exchange of the information whether the field exists. We can do that in Python but it seems stranger in C? EDIT: OK, I guess it is not strange in C. I think my main feeling is that once the old version is not used anymore A2 seems simpler.
If a new name is created, new C code can indicate that by using DLManagedTensorVersioned
in headers to give some type/ABI safety.
How exactly the DLManagedTensorVersioned
that should look like, I am not sure. Just a stub with the version or any eternally stable ABI fields (which is than cast to the specific versioned one). In the above DLManagedTensor
is included, which also works. The downside is that at some point the "included" version may change, so I am wondering if it isn't clearer to force the user to cast. (The layout is still the same in both cases of course, the question is how to expose it in the headers.)
If a new name is created, new C code can indicate that by using
DLManagedTensorVersioned
in headers to give some type/ABI safety.
How would one do that? I think the idea of A2 is to still export the managed_tensor
field of DLManagedTensorVersioned
struct and use offset of the version field to retrieve the version info.
I don't know how current API looks like. I imagine exposing something like from_dlpack(DLPackManagedTensor *tensor)
and DLPackManagedTensor *managed_tensor = to_dlpack(object)
.
And the only way to transition that, is probably to create a new pair of functions that use DLManagedTensorVersioned *
? But if that is how the transition happens anyway, then I don't see much point in placing the version last.
So yes, a current from_dlpack
could remain compatible with newer versions for now. But for how long? If there is any ABI change in the future it will remain being compatible. so I am not sure that would actually give a very clear transition for when an ABI change happens. (It would assume that eventually no old users exist and then ABI break is possible?)
So, I feel that advantage of "ABI compatibility" with current from_dlpack
is unclear. So I would preference doing what seems cleanest in the long term. And for me, that is placing the version first and renaming the struct.
And the only way to transition that, is probably to create a new pair of functions that use
DLManagedTensorVersioned *
?
As far as C/C++ is concerned, I too think this is the only way to transition. Both A1 and A2 sound like good options then. I also see now how A2 can provide type safety. @tqchen The A2 option says "In new version, allocate DLManagedTensorVersioned but still return the pointer to the managed_tensor." I don't think we should do that since it takes the type safety advantage away (making A1 a simpler solution). Otherwise, I am happy to go with A2 too.
What do you think @mattip?
Let's wait for some others to chime in and then I can update the PR.
I like A2 as well.
One of the mentioned disadvantages of A1 is "No changes to the current fields of DLTensor itself can be made". How would this look for A2? (In general, it seems to me that adding any fields to DLTensor will be an ABI break.)
How would this look for A2?
We can add new fields onto DLTensor since the version is the first field in DLManagedTensorVersioned (as opposed to A1 where the version field comes after DLTensor).
This would work for software that knows how to interpret a DLManagedTensorVersioned
with that specific version, but it would still break older software that depends on a compatible layout, no? How is that not an ABI incompatibility?
The idea is to introduce new functions (to_dlpack(DLManagedTensorVersioned *)
and from_dlpack(DLManagedTensorVersioned *)
) that convert and export the tensor to a DLManagedTensorVersioned *
. It's good to have for type safety but otherwise, A1 is simpler.
The killer feature of DLPack is its simplicity: it's just a header file with a few data structures that are trivially ported to various languages. Frameworks that can exchange DLPack tensors don't need to link to a shared or static library to do so. They don't need even deal with C or C++. Introducing required versioning-related functions as part of this repository sounds like it has the potential to significantly complicate this success story. My 2cts...
(Edit: I am against versioning per se. I am just wondering what this means for ABI interoperability between software that may and may not implement this versioned extension.)
Without versioning, DLPack can't evolve. There are already requests for read-only data support, non-native endianness, etc. Adding those without a version to separate the new structure will lead to spurious breaks everywhere (since not everyone will update at the same time and newer functions won't work with older ones). I am not sure if there is a simpler way; after a lot of discussions, these options seem to be the simplest of all. In long term, we can remove the old structs and also update the spec to not care about the old ABI.
Introducing required versioning-related functions as part of this repository sounds like it has the potential to significantly complicate this success story.
The proposal is to add version-info in the struct to avoid this exactly. All the complications happen in the spec and not the DLPack header file...
Hi everyone,
The issue has been up for almost a month. So far, I think @mattip, @seberg, and @leofang are in favor of versioning and @wjakob is against it. Let's wait until 16'th May for more feedback and, if there is none, it would be good to merge #105 and wait until 23'rd May before closing this to allow some time to revert the changes if needed. Please feel free to add your thoughts on how DLPack should proceed with versioning and ABI break. Thanks!
this is a tricky situation which I suspect comes down to how everyone actually uses DLPack in practice. it'd be helpful to know more about everyone's usage of DLPack so we could evaluate whether we're all debating the same use case or if there are others.
my initial impressions of the proposals:
A0.
DLManagedTensorVersioned
is found.might I suggest A3, a variant of A2:
static constexpr const uint32_t kVersionedDLTensorMagic = 0xD1D1D1D1;
struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
};
// note: supersedes DLManagedTensor
struct DLManagedTensorVersioned {
Version version;
void * manager_ctx;
void (*deleter)(struct DLManagedTensorVersioned* self);
uint32_t magic; // set to kVersionedDLTensorMagic ^ (((uintptr_t) &self->dl_tensor) - ((uintptr_t) self))
DLTensor dl_tensor;
}
here is what i think example usage might look like:
manager_ctx
and deleter
void* caller_tensor
.ptrdiff_t version_ptrdiff = *((int32_t*) caller_tensor )- 1) ^ kVersionedDLTensorMagic
, which may be the difference between the address of DLManagedTensorVersioned.dl_tensor and DLManagedTensorVersioned. We could also enforce here that version_ptrdiff is e.g. < 0x100
.DLManagedTensorVersioned* candidate_versioned = ((uintptr_t) caller_tensor) - version_ptrdiff;
candidate_versioned->version
is supported and validate version_ptrdiff based on knowledge of that version (e.g. v2 expects version_ptrdiff == 16
).DLManagedTensor dl_managed_tensor = caller_tensor
and treat as v1.deleter(self)
, where those fields are looked-up as is apropos given version
.the advantage to this is that the managed fields are moved next to the version, so that they could be used to terminate the lifecycle of a tensor whose version is incompatible with the client library. this scheme is of course vulnerable to various memory tricks as well--you could consider patching Version to include some xor'd magic too. it is more complex, though. thoughts?
it'd be helpful to know more about everyone's usage of DLPack
That is the hard part here, and why @tirthasheshpatel submitted #106. Other users of DLPack can add additional use cases. Without concrete use cases, it is difficult to relate to "what if" scenarios.
To me the A3 proposal seems too "magical". The use of __dlpack_info__
in A2 allows the provider to explicitly state what version of the API they will support, answering "would the API tell you this?" with "yes".
@wjakob when you say
Introducing required versioning-related functions as part of this repository sounds like it has the potential to significantly complicate this success story
do you agree that the structure is incomplete? If not, what do you respond to NumPy who, in accordance with the Array API consortium recommendations, added support for DLPack but requested support for readonly and byteorder flags?
This need for versioning was pointed out in #34 which was opened in March 2019, and has been extensively discussed in other issues over the passing years.
Just to make that clear (it seems there was some misunderstanding). I am definitely not against versioning. However, it would be useful if versioning-related information is appended to data structures so that they are still interpretable by older versions at an ABI level.
Second -- there is a lot of value to having DLPack data structures described as POD/Plain Old Data in a way that is accessible from many different languages without, e.g., having to include this repository or linking to a library. There was some of including C-level functions, for example by @tirthasheshpatel
The idea is to introduce new functions (
to_dlpack(DLManagedTensorVersioned *) and from_dlpack(DLManagedTensorVersioned *)
) that convert and export the tensor to aDLManagedTensorVersioned *
. It's good to have for type safety but otherwise, A1 is simpler.
Maybe this is fine. However, this is a deviation from the current purely POD-style interface where the only kind of function is a function pointer.
Finally, one suggestion that I would also have is that the PR#101 adds some kinds of flags field.
Adding uint8_t readonly;
is kind of wasteful (1/8 bits used, and another 3 unused bytes will be added for padding).
Why not add a uint32_t flags
value? This could represent readonly status, endianness, and whatever other people will want to add in the future without having to discuss an ABI break every time.
@mattip agreed A3 is kind of magical and that's less desirable than an explicit indication. It seemed like there was an attempt being made to avoid breaking backwards-compatibility by considering the DLTensor
fields as sacred and locating new fields elsewhere.
I think there are a few possibilities:
__dlpack_info__
helps here because it would be the consumer's __dlpack_info__
wrapping the produced DLManagedTensor, right?__dlpack_info__
is a reasonable out-of-band signaling mechanism.I should also state that my interest here is purely from POV of having TVM interoperate with other libraries. These possibilities capture the range of situations I could imagine a standard such as this needing to confront.
@wjakob
it would be useful if versioning-related information is appended to data structures so that they are still interpretable by older versions at an ABI level.
Yes, this is exactly what #105 will do (I will update the PR to follow A2).
Second -- there is a lot of value to having DLPack data structures described as POD/Plain Old Data in a way that is accessible from many different languages without, e.g., having to include this repository or linking to a library. There was some of including C-level functions, for example by @tirthasheshpatel
I meant the C/C++ libraries that want to add support for the new ABI will have to create those new functions. DLPack itself won't have any functions and will still remain a POD. So, no linking will be necessary.
Why not add a
uint32_t flags
value? This could represent readonly status, endianness, and whatever other people will want to add in the future without having to discuss an ABI break every time.
The plan is to batch all the requests like readonly, endianness, alignment, etc in one single ABI break. Althouh, it would be useful to also add some padding for future use.
@areusch
- Two C++ libraries integrating without any Python support: If one library is to return a DLManagedTensor, we either have to signal a version by a) modifying memory outside the bounds of the v1 structure; b) modifying an unused field in the v1 structure; or c) introducing an explicit API which carries the drawbacks @wjakob mentioned (i.e. this is no longer just an in-memory format).
For C++ libraries: new functions like DLManagedTensorVersioned *to_dlpack_versioned(...)
and ... from_dlpack_versioned(DLManagedTensorVersioned *)
can be added. Since the new struct is different, we get type safety and there would be no straightforward way for the users to mix DLManagedTensor and DLManagedTensorVersioned.
- Consumer has Python support but is integrating with a C++-only producer library
I think this one's a tricky case. Yes, if a C++ tensor is imported in Python, there would be no __dlpack_info__
for the consumer to call. But I also don't know how would one import C++ tensors in Python. There has to be a C/C++ API using which the two communicate, in which case, it would be simply an exchange at C level. They can use the suggestion above i.e. introduce new to_dlpack_versioned and from_dlpack_versioned.
The problem I have is, that I still don't know how the C++ usage really looks like. In Python, I think this all works out great, because the usage is only an exchange Protocol. That means we have basically 1-2 functions to worry about and we can break API on those (even using a try/except
).
For all I know, the C++ usage may be looking more like:
import some_library
tensor = DLPackManagedTensor() # my data
result = some_library.do_calculation(tensor)
# or even, more complicated:
result = some_library.random_data() # no input.
Now, if your API looks like that, you cannot replace every single function that takes a DLPackManagedTensor
. And while it may be helpful, user of the library probably will not wan to deal with the logistics like:
result = some_old_library(to_old_dlpack(tensor))
result = to_new_dlpack(result)
(Those functions could be static inline functions included in the header, just for "emergency" use to support an outdated library though!).
Now, if we are dealing with C++ – as far as I understand – C++ should be doing name-mangling, so you can write some_library
to support two ABIs by simply renaming the structs and exposing overloads.
In that case, the user has to wait for adopting the new ABI until some_library
supports it (or use conversion helpers in rare cases). But if you wait, things should work great and you get a linking error if some_library
is outdated.
For a C only API, name mangling doesn't exist, so the "magic" may well be a useful way for a library to support both.
OTOH, I am not sure if it wouldn't be just as well to add an #ifdef USE_NEW_DLPACK_ABI
to the header and than mangel the symbol names manually (without the user really noticing).
From a Python API perspective: I don't think we as "Python exchange protocol" folks should care whether or not you insist on doing "magic". It simply doesn't affect us because we have better ways to deal with it. We can just trivially break the ABI by just adding a single try/except
or similar.
But, at least me as a Python exchange protocol person do feel that trying to make dlpack a generic exchange protocol in Python requires an ABI break (or preparing for ABI break). There are already clear limitations, and a IMO a modern exchange protocol can be expected to have ABI versioning.
Hope you don't mind me jumping in with some ideas even though I just today learned of this wonderful project.
I'll talk about the C ABI only. (I think the Python parts can work on top of what I propose below, though I could be missing something.)
It seems to me that readers (of the structs, not necessarily of the tensor data; same henceforth) usually need to evolve ahead of writers in order not to disrupt end users, given that the flags being considered for addition (read-only, byte-swapped) are things that cannot be safely ignored by a reader (it is better for a reader to give up rather than (try to) mutate a read-only buffer or use the wrong byte order). So I think what is important is for new readers to be able to safely read structs written by old writers, not the other way around. For this to be the case, there needs to be a way to check the version without danger of undefined behavior (segfault or worse).
I'm assuming here that binary compatibility (affects end users) is more important than source compatibility (affects mostly library devs).
If it is super-easy for the same reader to read either ABI v1 (current) or v2 (first versioned version), there could be a migration period during which libraries start supporting reading either v1 and v2 but continue to write v1 (except in testing, or in cases where the new features are essential). Once enough of the ecosystem knows how to read v2, libraries can start writing v2 with no disruption.
One way to ensure that there is no danger of undefined behavior with v1 and v2 structs coexisting is to use the high bit (bit 31) of device_type
, which is currently always 0. We can set the bit to indicate a versioned (managed) tensor. Hopefully existing readers check this field, and would reject the new format safely (albeit with the wrong error message, although that's easy to fix).
Also, looking at the suggestions (as in #105) to add readonly
(and future fields) to DLManagedTensor
rather than DLTensor
(for understandable reasons), it seems like there would no longer be much point in having separate structs for DLTensor
and DLManagedTensor
, because the former would no longer be usable by itself. DLTensor
could just contain manager_ctx
and deleter
at the same offset, and zero them if not managed.
Taking these points into account, here is what I came up with:
typedef struct {
int32_t device_type; // High bit set if versioned tensor
int32_t device_id;
} DLDevice;
typedef struct DLTensor {
void *data;
DLDevice device; // See above
int32_t ndim;
DLDataType dtype;
int64_t *shape;
int64_t *strides;
uint64_t byte_offset; // Unchanged up to here
void *manager_ctx; // Write zero if not managed
void (*deleter)(struct DLTensor *self); // Ditto
uint16_t version; // Readers must give up if higher than known
uint16_t flags; // Bits TBD; unused bits must be written zero
uint32_t reserved_in_v2; // Must be written zero
} DLTensor;
typedef DLTensor DLManagedTensor; // For compat
#define DLPACK_TENSOR_GET_VERSION(t) \
((t).device.device_type & 0x80000000 ? (t).version : 1)
#define DLPACK_TENSOR_GET_DEVICE_TYPE(t) \
((t).device.device_type & 0x7FFFFFFF) // Or ... | 0x80000000 with new enum constants
enum { // These are TBD, I'm just showing as examples
kDLFLAGS_READONLY_MASK = 1 << 0,
kDLFLAGS_BYTESWAPPED_MASK = 1 << 1,
};
#define DLPACK_TENSOR_IS_READONLY(t) \
(DLPACK_TENSOR_GET_VERSION(t) >= 2 && (t).flags & kDLFLAGS_READONLY_MASK)
// And so on
Note that manager_ctx
and deleter
are at the same offset as they always were, and that these new definitions can be used to write the v1 format as well, by simply clearing the high bit of device_type
. So libraries can switch to the new header but keep writing v1 structs until others are ready to read v2.
The enum constants for device type could be updated to include the set high bit, making writing v2+ the default. Function macros could be made static inline
functions instead. And there are probably other ways to improve this to decrease the chances of incorrect usage.
When future additions are made, the version number should be incremented if the added feature is something that readers cannot safely ignore, or if it requires enlarging the struct. Readers can indefinitely support structs of previous versions, as far as I can see, unless they require specific guarantees by the writer indicated by new features. Writers can choose to keep writing an older version (but maybe recommend >= 2) if they do not need any newer features and wish to support the broadest range of readers.
The reserved_in_v2
field may not be necessary (what are the chances of adding a feature that is not a boolean flag and does not require mandatory behavior of readers?), but I put it in to pad the struct to its natural alignment. If the field exists, writers must always zero it, or else future readers will misbehave. It is named with the version number so that future versions are less likely to cause bugs in existing source.
There could be exceptions to my view that readers generally need to support features before writers can adopt them. For example, a hypothetical "immutable" flag (not to be confused with read-only; immutable is an additional promise by the writer that data won't change) could be added without incrementing the version number, because readers that don't care can safely ignore it. (I'm not sure whether an "immutable" flag would be practically beneficial. Just an example to illustrate version evolution.)
Pinging here, since it has been idle for a long time and came up in NumPy yet again. To me (and others I have talked to), this is still a major blocker for any serious adoption of DLPack in Python. It would be nice to get some movement here. Otherwise we are sacrificing Python development to C++ compatibility which is a tradeoff that is not at all appealing to me from the Python side.
Thanks @seberg for the ping and I agree with the point being said. Sorry for dropping the thread.
After reading all the suggestions in the post, here is one suggestions resolution(that mostly evolves from A2)
struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
};
struct DLManagedTensorVersioned {
Version version;
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
// more flags
}
DLManagedTensorVersioned
. DLManagedTensor
remains the legacy version.The immediate need for more flags is the read_only one. Which I agree is useful. We can discuss whether make it a bitmask or a uint_8 flag. Likely a uint32 boolean mask might be helpful to allow future flags being added.
In order to manage the breaking change, we would need to update the python APIs in two steps:
from_dlpack
function check the ABI version in __dlpack_info__
, and raises exception when the version mismatches. Future version of ABI updates expects the capsule to be DLManagedTensorVersioned.The C++ side's code can automatically detects type errors as the type of DLManagedTensorVersioned
and DLManagedTensor
differs from each other.
Would love to see how folks think. @tirthasheshpatel @seberg @wjakob @mattip @areusch @rgommers
Maybe we could try to set up a half-hour zoom call around this since it seems we are not communicating the change properly. As I understand things, it is incorrect to check ABI version in from_dlpack(obj: PyCapsule) -> Array
, the capsule has already been created using either a DLManagedTensor
or a DLManagedTensorVersioned
. If the latter, and the consumer has not been updated to handled DLManagedTensorVersioned
, the consumer will raise an error since it will not understand the PyCapsule
contents. Once we have transitioned to most producers using DLManagedTensorVersioned
in Array.__dlpack__()
, then indeed the consumer can check the DLPack version and error if the version number is not understood.
As proposed in #101, the place to call __dlpack_info__() -> Version
is before calling the producer's Array.__dlpack__(self, *, stream: int = None, version: int = 70) -> PyCapsule
. The call would either return a DLManagedTensor
in the PyCapsule
(if version <=70
) or a DLManagedTensorVersioned
(if version > 70
).
Sorry I was not getting it clearly,
As per the new data-api, from_dlpack
takes in a class (nnot the capsule) that contains obj.__dlpack__
method, so the python implementation can call obj.
def from_dlpack(other: Object):
try:
ver = other.__dlpack__info()
except:
# older version that do not have info.
# checks here
return internal(other.__dlpack__())
so i guess are people passing around non-wrapped void*
that refer to either DLTensor or DLManagedTensor? I think this would be a way that the C++ side would fail to detect the error at compile time, since casts may be present.
@tqchen curious if you're taking a position on the exact ABI break to make here (since @marktsuchida suggested one above), or whether you're first wanting to tackle the semantics of how the program behaves in the presence of a version mismatch?
In the case of C++:
void*
, it is usually a result of FFI, given the primary use case is python, S0 step is designed so that our most common use-case, python array exchange have a clear path for updating without breaking things.Indeed the proposal won't handle the breaks where a C++ api takes a void* and do cast. This is a very rare case, and the benefit of having a stable clean ABI for future versions outweights the one-step transition patches to support such cases, which I am not aware of any in production.
Cool, I guess in C++ there is no serious ABI issue (unless dirty casts things are done). In C, there is in principle and libraries may need to introduce a new function name when adopting (if they wish to be careful and distinguish), but it seems not a major concern?
If we can agree on such an ABI break by introducing a new name, then I think we can proceed with making a concrete proposal that is Python friendly (or a concrete proposal with the Python API in mind).
I am unsure whether we can get away with not modifying DLTensor
(creating a new name), but I suspect we can at least for the time being.
If that is the step where we are, I would suggest a meeting with those interested to discuss how a very concrete Python API/friendly proposal will look like. That might just be Matti, me, and others from the Python side in a first instance but it would also be great if more will join.
On the C++ side, as long as we make it clear we discourage C-style cast and reinterpret_cast
it should be safe I think.
+1 for a meeting on this.
Happy to host a half-hour zoom call on this. Please fill out your availability here https://www.when2meet.com/?16752946-wS8RV for the week of Sep 19.
Note that the time is in EDT, so if you are in different time-zone, maybe some converstion is needed.
Sorry, @tqchen, is it possible to open up time slots in the week of Sep 26? NVIDIA folks (=me 😅) would be busy in the Sep 19 week due to GTC.
OK, updated the link to change ro week of Sep 26
I don't have enough of a stake in this to warrant joining the call, but just a quick note that I wrote my above comment under the assumption that (1) existing consumers blindly assume that received capsules contain (as type-erased void *
) the existing DLManagedTensor
layout, and that (2) it is desirable to prevent such consumers from crashing or corrupting memory when they receive the new (versioned) data layout. (As far as I can tell, a magic checksum does not help with (2), because existing consumers don't know about the magic. Hence the use of the high bit of device_type
.)
I think I may have been incorrect about (1) -- if the capsule names (dltensor
and used_dltensor
) are changed (once and for all, say, to dlvtensor
and used_dlvtensor
) when the new, versioned data layout is introduced, there will be no chance of (correct) old consumers trying to read new data incorrectly. Updated consumers can support the old and new names for a period, if desired. (Changing the capsule names was mentioned by @tirthasheshpatel in #34.)
The scheme I proposed is complicated and ugly, and would remain so semi-permanently. Its only advantage is that it can offer memory safety (in the (2) sense) without the type information having to travel separately. I wouldn't recommended it if there are other solutions like changing the capsule names.
thanks @marktsuchida for careful thoughts, you are more than welcome to join the call btw
Let us lock down to Friday, 230 EDT. Sep 30 given the responses. Please mark on your calendar and I will send a zoom link here
The open discussion meeting of DLPack ABI update is scheduled to be on Friday 230EDT Sep 30. Please use the zoom link to join the meeting
Thanks @leofang @seberg @kkraus14 for joining the meeting. We will keep this thread open for one week feedback then post a notice for one month before we do the update
struct DLManagedTensor {
DLTensor dl_tensor;
void * manager_ctx;
void (*deleter)(struct DLManagedTensor * self);
};
/*!
* \brief The DLPack and DLPack ABI versions of the tensor.
*/
typedef struct {
/*! \brief DLPack version. */
uint32_t dlpack;
/*! \brief DLPack ABI version. */
uint32_t abi;
} DLPackVersion;
struct DLManagedTensorVersioned {
DLPackVersion version;
void * manager_ctx;
void (*deleter)(struct DLManagedTensorVersioned * self);
uint64_t flags;
DLTensor dl_tensor;
}
typedef DLPACK_BIT_MASK_READ_ONLY 1
@tqchen do you have any timelime to release the managed version ? we are maintaining rust bindings here https://github.com/kornia/dlpack-rs that we use for image/video loading in kornia for which we want to release soon new versions.
This is a thread to bring awareness and discuss the recent set of proposed change of adding versioning information to DLPack.
Summary
Up until now the DLPack C struct itself does not carry ABI version information, and we always maintain backward compatibility. Up until now there is no ABI breaking changes.
The overall stability is one reason why frameworks adopt DLPack in the first place and we would like to continue to do moving forward. In the meantime, it is indeed helpful to clarify the DLPack ABI version in the data structure itself. contains a draft proposal for the change. We can still
In short, we are looking into the possibility of attaching version information to the DLTensor struct, to minimize the ABI change, the version information is appended in the end of the current DLTensor
Summary of key changes
__dlpack_info__
function that returns the current supported API/ABI version.Compatibility implications
One thing that is worth clarifying is that whether this constitute a ABI breaking change. It depends on how people use it. Let us clarify the following scenarios.
__dlpack_info__
then do the conversion according__dlpack_info__
to indicate that it is already at a new versionNormally S0 means that the data structure ABI is still backward compatible (new data structure being used in old scenarios). S1 somewhat seats in the future-compatible regime(old data structure being used in new scenarios).
Related threads
What to Expect
This is a notice and discussion thread to let the community that this is happening. It would be great to get folks to also tag related framework communities, so the ecosystem is fully aware of(and ideally buy-in) the proposed change before we formally proceed.
Given this change does have compatibility implications (although not breaking depending on how we see it), we will have a longer period of NOTICE(expect 1-2 months) before we proceed. Please tag as many folks you think are related as possible.
In the future, we also suggest to open thread of this kind to discuss the compatibility implication of the changes, if any.