Closed jperez999 closed 1 year ago
This PR adds the ability to create tensor columns from 3d arrays
It seems that we're already able to construct tensor columns representing 3d arrays both dense and ragged (2nd dimension).
TensorColumn(np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]), offsets=np.array([0, 1, 3]))
# two rows. 2nd dimension ragged, 3rd dimension of size 2
# [[0.1, 0.2]]
# [[0.3, 0.4], [0.5, 0.6]]
# shape: [2, None, 2]
The change to the shape in _construct_shape
I suppose counts as support 3d arrays, which currently returns an incorrect value if passing values/offsets where the values tensor has a rank of 2 or more.
The rest of the change could more broadly be described as extending the column constructor to support python lists containing rows of values as an alternative method of column construction. And perhaps could be part of a different PR than the correction to the shape method.
support python lists containing rows of values as an alternative method of column construction
If we frame this change as enabling constructing columns from python lists containing row values. Then maybe would make sense to support constructing a column from a list of scalars too.
e.g.
NumpyColumn([[1], [2]]) # ok
NumpyColumn([1, 2]) # Raises TypeError: 'int' object is not iterable
And maybe would make sense to also extend support for this style of column construction to the tensorflow and torch columns which currently raise errors about missing attributes (they currently assume that the values
passed is a framework tensor, not a python list corresponding to rows.
I think those would be fine changes to make. The intent of this PR wasn't to provide full support for Python lists though, so they seem like something we could defer to a future PR. I think we're all generally aligned on the notion that there will be additional changes required to flesh out support for Python lists though.
My mental criteria for when a PR makes sense to merge are:
I think those are true in this case, so it seems okay to move forward with. 🤷🏻
This needs work we cannot merge in current state. All tensors must be completely flat in order to successfully dlpack transfer. Meaning we need to keep stride. Luckily reshape is supposed to be almost a noop https://stackoverflow.com/questions/36062574/why-is-reshape-so-fast-spoiler-copy-on-write#:~:text=The%20runtime%20of%20reshape%20is,%3E%3E
Luckily reshape is supposed to be almost a noop
Based on what we found about dataloader performance with reshape. TensorFlow may be an exception to this general rule? https://github.com/NVIDIA-Merlin/dataloader/pull/116
All tensors must be completely flat in order to successfully dlpack transfer
Is this issue of stride specific to dlpack transfer to/from TensorFlow tensors to other frameworks?
Luckily reshape is supposed to be almost a noop
Based on what we found about dataloader performance with reshape. TensorFlow may be an exception to this general rule? NVIDIA-Merlin/dataloader#116
Yes seems like the -1 reshape is not a great thing to run against tensorflow. The idea is to move that into the array side in the dataloader to circumvent this problem in TF.
All tensors must be completely flat in order to successfully dlpack transfer
Is this issue of stride specific to dlpack transfer to/from TensorFlow tensors to other frameworks?
Yes this is mainly an issue with tensorflow and particularly going from a cupy dlpack capsule to a tf tensor. And it only happens when you need to split the original tensor ( I.e. when you have a chunk and you want to get a batch out). We didnt run into this before because we would dlpack the entire chunk before and then split in the tensors target framework.
It looks like this still has some uses of CuPy that trigger in environments where there's no available GPU. Otherwise seems fine, but that's probably worth fixing
This PR adds the ability to create tensor columns from 3d arrays, more specifically embedding arrays. It can support more than three dimensions so long as there is only one ragged dimension and all dimensions after are fixed, representing an embedding.