NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.

https://nvidia-merlin.github.io/Transformers4Rec/main

Apache License 2.0

1.08k stars 142 forks source link

[QST] Tags.TEXT preprocessing and model input #718

Closed MatthiasEg closed 1 year ago

MatthiasEg commented 1 year ago

❓ Questions & Help

Details

Hello everybody,

I'm trying to model sequential data with various properties, one of which is a text field (1-10 words). What is the intended process to include such text fields with Transformers4Rec? I have seen that there is schema tag (Tag.TEXT), but this alone is of no use, as apparently cuDF does not support string based data at this time. Should therefore the text already be tokenized in advance and then Tag.TOKENIZED be added additionally?

Any help is greatly appreciated!

rnyak commented 1 year ago

@MatthiasEg we dont use Tag.TEXT. If you want to feed text features to TF4Rec model, you need to convert your text features to a numerical representation (embeddings would be better) first (using BERT, or GPT2, or whatever model u want to use), then feed to the model as pre-trained embeddings. You can check out this unit test example.

Please note that cuDF DOES support string based data but that depends on if your string data is long, and you have a large dataset.

rnyak commented 1 year ago

@MatthiasEg I am closing this ticket due to low activity. Please reopen if needed.