Semantic Sentence Tokenization

MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

MIT License

5.82k stars 724 forks source link

Original Text: "I like the ambiance but the food was terrible." Desired Output: ["I like the ambiance"] ["but the food was terrible."] Original Text: "I don't know. I like the restaurant but not the food." Desired Output: ["I don't know."] ["I like the restaurant"] ["but not the food."]

Hi! I might be mistaken but I do not believe there is a technique commonly used for these kinds of semantic sentence tokenization since the separation of the original highly depends on the abstraction level of the semantic separation. There are small tricks like using conjunctions and sentence splitters to create candidate splits and then using embeddings to model their potential differences.

For instance, you could split the input using a sentence splitter and then further split the sentences based on whether a conjunction exists in these sentences. Then, the resulting candidate phrases/sentences are embedded using any embedding technique. Finally, sequential candidate phrases are merged if they are similar enough (user-specified threshold).

It's not perfect but the general principle (at least in my head) seems like it might actually work.

MaartenGr / BERTopic

Semantic Sentence Tokenization #1936