Open TheAIMagics opened 3 months ago
Hi! I might be mistaken but I do not believe there is a technique commonly used for these kinds of semantic sentence tokenization since the separation of the original highly depends on the abstraction level of the semantic separation. There are small tricks like using conjunctions and sentence splitters to create candidate splits and then using embeddings to model their potential differences.
For instance, you could split the input using a sentence splitter and then further split the sentences based on whether a conjunction exists in these sentences. Then, the resulting candidate phrases/sentences are embedded using any embedding technique. Finally, sequential candidate phrases are merged if they are similar enough (user-specified threshold).
It's not perfect but the general principle (at least in my head) seems like it might actually work.
I'm working with a corpus that primarily consists of longer documents. I'm seeking recommendations for the most effective approach to semantically tokenize them.
Examples:
Any suggestions or advice on how to achieve this would be greatly appreciated!