UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.44k stars 2.5k forks source link

[question] How to tag document for clustering ? #1439

Open valibus opened 2 years ago

valibus commented 2 years ago

Hi, Thank you for all the job you're doing !

I was wondering if there is any way to tag the corpus element that we add ? instead of :

corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]

having for example an id :

corpus = [['ID1','A man is eating food.'],
          ['ID2','A man is eating a piece of bread.'],
          ['ID3','A man is eating pasta.'],
          ['ID4','The girl is carrying a baby.'],
          ['ID5','The baby is carried by the woman'],
          ['ID6','A man is riding a horse.'],
          ['ID7','A man is riding a white horse on an enclosed ground.'],
          ['ID8','A monkey is playing drums.'],
          ['ID9','Someone in a gorilla costume is playing a set of drums.'],
          ['ID10','A cheetah is running behind its prey.'],
          ['ID11','A cheetah chases prey on across a field.']
          ]

The main purpose will be for the model to return all ID in a cluser and not juste the whole document. (I tried it with an hundred document of 400 words each, the result is not understandable easily)

Can you provide any simple way to do it please ?

best I obtain is this :

Cluster  1
[['ID1', 'A man is eating food.'], ['ID2', 'A man is eating a piece of bread.'], ['ID3', 'A man is eating pasta.']]

Cluster  4
[['ID4', 'The girl is carrying a baby.'], ['ID5', 'The baby is carried by the woman']]

Cluster  5
[['ID6', 'A man is riding a horse.'], ['ID7', 'A man is riding a white horse on an enclosed ground.']]

Cluster  2
[['ID8', 'A monkey is playing drums.'], ['ID9', 'Someone in a gorilla costume is playing a set of drums.']]

Cluster  3
[['ID10', 'A cheetah is running behind its prey.'], ['ID11', 'A cheetah chases prey on across a field.']]

But I'm not shure the Id and other special array caracter are not taken in the text processed

Hope you can help :)

nreimers commented 2 years ago

If you add the ID like this, it will be taken into the computation of the embedding.

The respective methods for clustering return the index of for the passed list. This tells you which sentences are in which clusters

valibus commented 2 years ago

The respective methods for clustering return the index of for the passed list. This tells you which sentences are in which clusters

Ok, which method do you suggest to use please? I don't understand which one you're refering to while saying "respective methods"

Best regards