NVIDIA-Merlin / core

Core Utilities for NVIDIA Merlin
Apache License 2.0
19 stars 14 forks source link

add 3d tensor support to creating tensor columns #246

Closed jperez999 closed 1 year ago

jperez999 commented 1 year ago

This PR adds the ability to create tensor columns from 3d arrays, more specifically embedding arrays. It can support more than three dimensions so long as there is only one ragged dimension and all dimensions after are fixed, representing an embedding.

oliverholworthy commented 1 year ago

This PR adds the ability to create tensor columns from 3d arrays

It seems that we're already able to construct tensor columns representing 3d arrays both dense and ragged (2nd dimension).

TensorColumn(np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]), offsets=np.array([0, 1, 3]))

# two rows. 2nd dimension ragged, 3rd dimension of size 2
# [[0.1, 0.2]]
# [[0.3, 0.4], [0.5, 0.6]]
# shape: [2, None, 2]

The change to the shape in _construct_shape I suppose counts as support 3d arrays, which currently returns an incorrect value if passing values/offsets where the values tensor has a rank of 2 or more.

The rest of the change could more broadly be described as extending the column constructor to support python lists containing rows of values as an alternative method of column construction. And perhaps could be part of a different PR than the correction to the shape method.

oliverholworthy commented 1 year ago

support python lists containing rows of values as an alternative method of column construction

If we frame this change as enabling constructing columns from python lists containing row values. Then maybe would make sense to support constructing a column from a list of scalars too.

e.g.

NumpyColumn([[1], [2]])   # ok 

NumpyColumn([1, 2])  # Raises TypeError: 'int' object is not iterable

And maybe would make sense to also extend support for this style of column construction to the tensorflow and torch columns which currently raise errors about missing attributes (they currently assume that the values passed is a framework tensor, not a python list corresponding to rows.

karlhigley commented 1 year ago

I think those would be fine changes to make. The intent of this PR wasn't to provide full support for Python lists though, so they seem like something we could defer to a future PR. I think we're all generally aligned on the notion that there will be additional changes required to flesh out support for Python lists though.

My mental criteria for when a PR makes sense to merge are:

I think those are true in this case, so it seems okay to move forward with. 🤷🏻

jperez999 commented 1 year ago

This needs work we cannot merge in current state. All tensors must be completely flat in order to successfully dlpack transfer. Meaning we need to keep stride. Luckily reshape is supposed to be almost a noop https://stackoverflow.com/questions/36062574/why-is-reshape-so-fast-spoiler-copy-on-write#:~:text=The%20runtime%20of%20reshape%20is,%3E%3E

oliverholworthy commented 1 year ago

Luckily reshape is supposed to be almost a noop

Based on what we found about dataloader performance with reshape. TensorFlow may be an exception to this general rule? https://github.com/NVIDIA-Merlin/dataloader/pull/116

oliverholworthy commented 1 year ago

All tensors must be completely flat in order to successfully dlpack transfer

Is this issue of stride specific to dlpack transfer to/from TensorFlow tensors to other frameworks?

jperez999 commented 1 year ago

Luckily reshape is supposed to be almost a noop

Based on what we found about dataloader performance with reshape. TensorFlow may be an exception to this general rule? NVIDIA-Merlin/dataloader#116

Yes seems like the -1 reshape is not a great thing to run against tensorflow. The idea is to move that into the array side in the dataloader to circumvent this problem in TF.

jperez999 commented 1 year ago

All tensors must be completely flat in order to successfully dlpack transfer

Is this issue of stride specific to dlpack transfer to/from TensorFlow tensors to other frameworks?

Yes this is mainly an issue with tensorflow and particularly going from a cupy dlpack capsule to a tf tensor. And it only happens when you need to split the original tensor ( I.e. when you have a chunk and you want to get a batch out). We didnt run into this before because we would dlpack the entire chunk before and then split in the tensors target framework.

karlhigley commented 1 year ago

It looks like this still has some uses of CuPy that trigger in environments where there's no available GPU. Otherwise seems fine, but that's probably worth fixing