Open vinhngx opened 3 years ago
Could this be accomplished by applying an NVT workflow/transform at data-loading time and using a JoinExternal
op to add the dense features to each example?
that's what I imagine I could do --- but didn't work apparently.
JoinExternal
only works offline it seems.
Let me know if you can make it works :)
Were you encountering a particular issue or error with JoinExternal
at data-loading time? AFAIK, that should work with the current state of NVTabular, so if it doesn't seems like there may be a bug to resolve.
It's not very clearly documented (yet), but just in case, I think the way to do online pre-processing is KerasSequenceLoader(workflow.transform(dataset), ...)
. As long the workflow has been fit to the data ahead of time, JoinExternal
should work there. If it doesn't, would be great to know what breaks.
Yeah this sounds promising. My problem is there are features that should be preprocess offline and dump to disk, just like a regular workflow, and features that should be, ideally, merged online at run time.
However when I've do JoinExternal
with feature data in the workflow, and do workflow.fit(), it runs into OOM error already (data size will be 25Mx3k rows).
Maybe we can get around with 2 separate workflows, 1 offline to preprocess the normal features, and 1 online to merge multi-modal features at run time with KerasSequenceLoader(workflow.transform(dataset), ...)
.
Maybe I'm misunderstanding the shape of the data, but it seems like joining the dense features would result in something like 25M rows with 3k columns? If the text/image embeddings were represented as list columns, it seems like the number of columns could be significantly decreased, but I'm not sure if that helps or hurts with respect to the OOM error.
(This seems conceptually related to #854 and #871. Noting just to link the issues.)
Ah, I see from the presentation that it is 25M rows x 3k columns and nearly 400GB, so the OOM makes sense. 👍🏻
Is your feature request related to a problem? Please describe.
This feature arises in the context of multi-modal data from where, in additional to tabular data, we can extract high dimensional numeric features from text/images etc...
As an example, for the movielens data, we can extract 3072 dense features from movie synopsis and poster.
If we merge these dense feature with the rating data (25M rows), that inflates the data by 3 orders of magnitude.
Describe the solution you'd like Ideally, NVTabular should provide functionalities to join the dense feature at run time, prior to feeding the data to the DL framework.
Describe alternatives you've considered
Additional context Add any other context, code examples, or references to existing implementations about the feature request here.
See presentation (internal): https://docs.google.com/presentation/d/1I_VejP9P2aAGLHnwEEV11Ta3AuC4JiUdn2gVIxGo1U4/edit#slide=id.gdbd10dd64a_0_107