NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.03k stars 143 forks source link

[FEA] Data loader that support merging an external source at runtime #882

Open vinhngx opened 3 years ago

vinhngx commented 3 years ago

Is your feature request related to a problem? Please describe.

This feature arises in the context of multi-modal data from where, in additional to tabular data, we can extract high dimensional numeric features from text/images etc...

As an example, for the movielens data, we can extract 3072 dense features from movie synopsis and poster.

If we merge these dense feature with the rating data (25M rows), that inflates the data by 3 orders of magnitude.

Describe the solution you'd like Ideally, NVTabular should provide functionalities to join the dense feature at run time, prior to feeding the data to the DL framework.

Describe alternatives you've considered

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

See presentation (internal): https://docs.google.com/presentation/d/1I_VejP9P2aAGLHnwEEV11Ta3AuC4JiUdn2gVIxGo1U4/edit#slide=id.gdbd10dd64a_0_107

karlhigley commented 3 years ago

Could this be accomplished by applying an NVT workflow/transform at data-loading time and using a JoinExternal op to add the dense features to each example?

vinhngx commented 3 years ago

that's what I imagine I could do --- but didn't work apparently. JoinExternal only works offline it seems.

Let me know if you can make it works :)

karlhigley commented 3 years ago

Were you encountering a particular issue or error with JoinExternal at data-loading time? AFAIK, that should work with the current state of NVTabular, so if it doesn't seems like there may be a bug to resolve.

karlhigley commented 3 years ago

It's not very clearly documented (yet), but just in case, I think the way to do online pre-processing is KerasSequenceLoader(workflow.transform(dataset), ...). As long the workflow has been fit to the data ahead of time, JoinExternal should work there. If it doesn't, would be great to know what breaks.

vinhngx commented 3 years ago

Yeah this sounds promising. My problem is there are features that should be preprocess offline and dump to disk, just like a regular workflow, and features that should be, ideally, merged online at run time.

However when I've do JoinExternal with feature data in the workflow, and do workflow.fit(), it runs into OOM error already (data size will be 25Mx3k rows).

Maybe we can get around with 2 separate workflows, 1 offline to preprocess the normal features, and 1 online to merge multi-modal features at run time with KerasSequenceLoader(workflow.transform(dataset), ...).

karlhigley commented 3 years ago

Maybe I'm misunderstanding the shape of the data, but it seems like joining the dense features would result in something like 25M rows with 3k columns? If the text/image embeddings were represented as list columns, it seems like the number of columns could be significantly decreased, but I'm not sure if that helps or hurts with respect to the OOM error.

karlhigley commented 3 years ago

(This seems conceptually related to #854 and #871. Noting just to link the issues.)

karlhigley commented 3 years ago

Ah, I see from the presentation that it is 25M rows x 3k columns and nearly 400GB, so the OOM makes sense. 👍🏻