Aligning tokens with supersenses?

victoryhb commented 4 years ago

Thank you very much for sharing the code for your excellent paper. Pardon me for asking this newbie question: how to align the tokens in the input sentence with the supersenses outputted from the model? For example, the words in the sentence "I went to the store to buy some groceries." do not appear to be aligned with the following senses

['noun.person']
['verb.communication']
['verb.social']
['verb.communication']
['noun.artifact']
['noun.artifact']
['verb.communication']
['verb.cognition']
['noun.artifact']
['noun.artifact']
['adv.all']
['adv.all']

as printed using the following code:

for i, id_ in enumerate(input_ids[0]):
  print(sensebert_model.tokenizer.convert_ids_to_senses([np.argmax(supersense_logits[0][i])]))

Could you please provide some example code for how to do this properly? Thanks a lot in advance!

MeMartijn commented 3 years ago

@victoryhb This might be a long shot, but I was wondering whether you figured this out in the end. I also can't seem to figure out how to align the tokens.

MeMartijn commented 3 years ago

@oriram Do you have any hints on how to align the predicted senses to words in sentences?

oriram commented 3 years ago

Hi @MeMartijn, There is no clear "alignment" as out-of-vocabulary words are split to multiple tokens (and therefore can have multiple supersenses). However, you can do one of the following:

Enumerate over input_ids and predicted supersenses - This will give you the supersense for each token.
Change the tokenizer code such that it returns the index of the first token for each "word"

Hope this helps, Ori

AI21Labs / sense-bert

Aligning tokens with supersenses? #4