UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.09k stars 2.46k forks source link

Is there a simple way to input additional features #990

Open Punchwes opened 3 years ago

Punchwes commented 3 years ago

Hi, thanks for sharing this library. My current scenario is besides the traditional BERT input (input_ids, attention_mask, token_type_id), I will be having another feature dict which looks like: {'new_feature': [1,1,1,1,2,3,4,5]}. So the input in my case will be like: {'input_ids': [x,x,x,x,x], 'attention_mask': [x,x,x,x,x,x], 'token_type_id': [x,x,x,x,x,x], 'new_feature': [x,x,x,x,x,x]}

The most straightforward way I can think of is to modify the dataloader, but it seems that the whole model pipeline only accepts the texts, so I need to modify the input to the pipeline as well which seems to be quite complex. The other potential way is to further process the input text to extract these features in the tokenize() function and update the output as well, but it might make the whole process very slow. Wonder is there a simple/straightforward way to achieve this from your perspective?

Best

nreimers commented 3 years ago

Hi @Punchwes Yes, you would need to modify whole pipeline so that your input feature can be used.

If you have a small (discrete) number of features, you can add them as text to your input.

E.g. You have the features: [guest] vs [user] [male] vs [female] vs [unknown] [america] vs [europe] vs [asia]

Then your input can look like: [guest] [male] [america] I love this song! [user] [unknown] [asia] Me too, this song is great