diasks2 / pragmatic_segmenter

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
MIT License
549 stars 55 forks source link

Abbreviations at end of sentences + unknown abbreviations #77

Closed Lightgreen40 closed 9 months ago

Lightgreen40 commented 1 year ago

Hi,

I really appreciate you putting all this work into this Segmenter. But I just wonder, if a purely "programmatic" approach can ever achieve satisfactory results. As you can see below, if abbreviations occur at the end of a sentence or there are abbreviations that are not in some "abbreviation list" (see GERMAN test below), then the scoring is quite low. I have the feeling that one would need some well-trained Neural Network to figure this out, as this is more of a "linguistic" problem. What I mean by that is that a human has no issues recognizing sentence boundaries, but only if that person understands that language! So I guess, one has to train some machine to actually "understand" the language as well.

ENGLISH: RAW INPUT IS: This is a sentence. And another sentence, or is it? I haven't read D. H. Lawrence's Lady Chatterley, but intend to do so. Who recommended this novel to me? It was, of course, Henry Fredrik Jr. Then we also have numericals, such as 12.543, among others! Finally, some nice and common abbreviations, such as e.g. and also i.e. But see especially the systematic Golden Rules collection provided by PragmaticSegmenter! OUTPUT SEGMENTED BY PRAGMATIC SEGMENTER.NET BE LIKE: [ This is a sentence. ] [ And another sentence, or is it? ] [ I haven't read D. H. Lawrence's Lady Chatterley, but intend to do so. ] [ Who recommended this novel to me? ] [ It was, of course, Henry Fredrik Jr. ] [ Then we also have numericals, such as 12.543, among others! ] [ Finally, some nice and common abbreviations, such as e.g. and also i.e. But see especially the systematic Golden Rules collection provided by PragmaticSegmenter! ] VERDICT: Correct sentence count is 8 sentences. Pragmatic segmenter segmented into 7 sentences. Recognition rate is therefore (the closer to 1, the better): 0,875

GERMAN: RAW INPUT IS: Im Gegensatz zum Vj. kann sich dieser Hj.-Abschluss sehen lassen! Nun denn, ich treffe heute John Fredrik Jr. Mal sehen, ob er Abkürzungen wie z.B., z. B. und i.S.v. kennt. Ach ja, hab ich schon John Fredrik Jr. erwähnt gehabt? Haha! Häh? OUTPUT SEGMENTED BY PRAGMATIC SEGMENTER.NET BE LIKE: [ Im Gegensatz zum Vj. ] [ kann sich dieser Hj. ] [ -Abschluss sehen lassen! ] [ Nun denn, ich treffe heute John Fredrik Jr. ] [ Mal sehen, ob er Abkürzungen wie z.B., z. B. und i.S.v. kennt. ] [ Ach ja, hab ich schon John Fredrik Jr. ] [ erwähnt gehabt? ] [ Haha! ] [ Häh? ] VERDICT: Correct sentence count is 6 sentences. Pragmatic segmenter segmented into 9 sentences. Recognition rate is therefore (the closer to 1, the better): 0,6666667