NVIDIA-Merlin / dataloader

The merlin dataloader lets you rapidly load tabular data for training deep leaning models with TensorFlow, PyTorch or JAX
Apache License 2.0
406 stars 25 forks source link

[FEA] Data loader: support to padding sparse sequential features on the left side #128

Open gabrielspmoreira opened 3 years ago

gabrielspmoreira commented 3 years ago

Is your feature request related to a problem? Please describe. The PyT and TF Dataloader support padding list (sparse) features to the right, which means that shorter list sequences will be completed with 0s in the right. For sequential recommendation, a common use case is to keep the last N user interactions, what can be done either in the preprocessing or in the model side. The NVT Slice op, supports truncating to the last N elements (by providing negative limits). But it is also useful to be able to do additional truncation in the model side (e.g. truncating with larger max seq. threshold with Slice op and tuning the best max sequence length according to model accuracy and training speed. To do such truncation in the model side, the padding needs to be applied by the Data Loader on the left side of the sequence features, so that when they are converted to dense tensors the padding 0s are placed on the left side. Thus, features could be sliced in the model like feature[:, -keep_last_n:] without loosing the sequence features of users with less than N interactions.

Describe the solution you'd like Create an argument for the datalodader sparse_padding_side, which by default is right, but can be set to left

lesnikow commented 3 years ago

I am assigned this, but I wanted to comment in addition, per CONTRIBUTING.md, and say that I have begun working on this.

lesnikow commented 3 years ago

Hello everyone, and @benfred @karlhigley @jperez999 in particular, since I see you have significant contributions to these test files:

Would you know, is there an existing unit test or tests that you can point me to that would be similar to a unit test to test and implement for this feature request? I have looked in test_torch_dataset.py, test_tf_dataset.py, test_dataloader_backend.py and the list_slice test in the test_ops.py test file. But I am not exactly sure currently what would be a good unit test template to test an implementation of adding right padding to the two framework dataloaders.

In particular, is there a unit test currently available for testing padding on the right, as mentioned in the issue by @gabrielspmoreira?

I did find def test_nested_list(): authored by @zhenwendai in test_tf_dataloader.py that has, for instance, pad_data_col = ..., but I am not sure whether this is testing padding in addition to testing the loading of nested lists. Thank you.

lesnikow commented 3 years ago

@gabrielspmoreira, when you have a chance, would you be able to point me to where in the codebase the existing right padding can be specified? I have looked in the codebase and dataloader modules, but so far I have not been able to find this. I would like to add the sparse_padding_side argument at this point in the codebase, or someplace logically related.

lesnikow commented 3 years ago

@benfred @jperez999 @karlhigley I would like to see today, would you have any help, advice, or guidance on my comment above?

lesnikow commented 3 years ago

@gabrielspmoreira I want to see with you today, would you have any guidance or advice on my comment above?

gabrielspmoreira commented 3 years ago

Hey @lesnikow . I have created some time ago a preliminary version of the code that converts the internal NVTabular representation of sparse features (values, offsets) to sparse tensors and @jperez999 ported it and integrated in the NVTabular dataloader later.

To give an example, let's say a column in parquet file has sequence/list values, with 3 rows like this

[10,20]
[30]
[40,50,60]

The internal representation of NVTabular (values, offset) would be something like, as the offset informs how many values we have for each row

values = [10,20,30,40,50,60]
offsets = [2,1,3]

Then the equivalent sparse matrix can be build with values and indices like this

values = [10,20,30,40,50,60]
indices = [[0,0],
                 [0,1],
                 [1,0],
                 [2,0],
                 [2,1],
                 [2,2]

Finally the sparse tensor is converted to dense tensor in this line, which is padded on the right. In this example I assume seq_limit=5 for this feature

[10. 20,  0,  0, 0]
[30,   0,  0,  0, 0]
[40. 50, 60, 0, 0]

To pad the items on the left, I believe we just need to substract the 2nd column of the indices for sparse matrix from the seq_limit,so that it becomes

indices = [[0,4],
                 [0,3],
                 [1,4],
                 [2,4],
                 [2,3],
                 [2,2]

From the current implementation in NVTabular, I understand that the _get_indices() method is responsible to return the indices for each value. So maybe including this code after this line (if padding_direction==True) can make the trick ;)

indices[:,1] = seq_limit - 1 - indices[:,1]

If we currently don't have tests for those data loader methods that converts the offset representation to sparse and dense features, it woud be good to create such tests using as test case something similar I have described here.

lesnikow commented 3 years ago

@gabrielspmoreira This is all very good to know, thank you. @gabrielspmoreira @jperez999 do you have a commit hash that you could point me to so that I can see and review this code change that @gabrielspmoreira mentioned?

lesnikow commented 3 years ago

@gabrielspmoreira, when you have a chance, would you be able to point me to where in the codebase the existing right padding can be specified? I have looked in the codebase and dataloader modules, but so far I have not been able to find this. I would like to add the sparse_padding_side argument at this point in the codebase, or someplace logically related.

@gabrielspmoreira Would you have any insight or guidance on this? I can see in the torch dataloader where this can be implemented based on your line reference above, but it sounded to me like there is some user-facing method or API for the dataloaders that you are referencing that I am having trouble finding or seeing. In addition to modifying the torch implementation, there would be this user-facing method signature that would need to be updated, and I would need to make sure that this specification is captured correctly by both the existing torch and tensorflow dataloaders.

gabrielspmoreira commented 3 years ago

Hey Adam. The user-facing class is the DataLoader. For example, in PyTorch it is the TorchAsyncItr class and in TF it is KerasSequenceLoader. We should create an argument sparse_padding_side, which should accept 'right' (default) or 'right'.

lesnikow commented 3 years ago

@gabrielspmoreira I am wondering, is this current torch implementation at dd9927e sufficiently fast for your intended use case? I have used the approach that you suggested for sparse tensors in torch, but I had to modify it with a helper function. Just using the method you outlined would give

[0, 0, 0, 20, 10]
[0,  0,  0, 0, 30]
[0, 0, 60, 50, 40]

instead of the desired

[0, 0, 0, 10, 20]
[0,  0,  0, 0, 30]
[0, 0, 40, 50, 60]

In particular, the rows are in reversed order. I added the _build_sparse_tensor_helper_process_column() method to modify the indices to the correct order. This method does use a python while block. The current tensorflow implementation at that referenced commit uses tf operations, resulting in more speed-optimized code.

Hence is the current torch implementation with its python while block fast enough for your desired use case?

if not, do you have any ideas or approaches on how to use built-in torch.sparse class methods or other methods to make this more optimized or vectorized?

I have been working on doing this, but I have not been able to yet find a way. One of the current difficulties is that methods like torch.pad only operate on dense tensors, and only pads constant amounts, whereas we want to pad row-dependent amounts. A similar approach to the current tensorflow implementation would need something analogous to tensorflow's RaggedTensor class and its methods. There is a proposal to add the analogous NestedTensor class to torch, but this is only implemented in a separate package, and not part of torch proper.

lesnikow commented 3 years ago

@gabrielspmoreira Could you also clarify whether your intended padding would produce

[0, 0, 0, 10, 20]
[0, 0, 0, 0, 30]
[0, 0, 40, 50, 60]

or

[0, 0, 10, 20, 0]
[0, 0, 30, 0, 0]
[0, 0, 40, 50, 60]

In other words, is this feature for sliding each row all the way to right-side of the matrix, or to left-pad a constant amount on the left? The latter is significantly easier to implement, from what I have seen currently, than the former.