MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.82k stars 724 forks source link

Semantic Sentence Tokenization #1936

Open TheAIMagics opened 3 months ago

TheAIMagics commented 3 months ago

I'm working with a corpus that primarily consists of longer documents. I'm seeking recommendations for the most effective approach to semantically tokenize them.

Examples:

Original Text: "I like the ambiance but the food was terrible."
Desired Output: ["I like the ambiance"] ["but the food was terrible."]

Original Text: "I don't know. I like the restaurant but not the food."
Desired Output: ["I don't know."] ["I like the restaurant"] ["but not the food."]

Any suggestions or advice on how to achieve this would be greatly appreciated!

MaartenGr commented 2 months ago

Hi! I might be mistaken but I do not believe there is a technique commonly used for these kinds of semantic sentence tokenization since the separation of the original highly depends on the abstraction level of the semantic separation. There are small tricks like using conjunctions and sentence splitters to create candidate splits and then using embeddings to model their potential differences.

For instance, you could split the input using a sentence splitter and then further split the sentences based on whether a conjunction exists in these sentences. Then, the resulting candidate phrases/sentences are embedded using any embedding technique. Finally, sequential candidate phrases are merged if they are similar enough (user-specified threshold).

It's not perfect but the general principle (at least in my head) seems like it might actually work.