aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Text fragmentation/segmentation based on formal grammar #34

Open akolonin opened 4 years ago

akolonin commented 4 years ago

Base on the progress with issue #22 , we want to use the formal grammar to identify boundaries of sentences in the token (word) streams in two cases:

  1. When the token (word) stream is provided by the speech recognition engine.
  2. When the token (word) stream is provided by the HTML stripper applied to HTML texts where the natural language sentences are split not by conventional periods, explanations and question marks, but with some weird HTML tags with some custom styles applied to them.

The solution would have at least two applications: A) Split the stream of tokens/words into sentences for further linguistic processing such as parsing and entity extraction B) Split the stream of tokens/words into sentences for selecting the "featured" sentences containing some "hot" keywords for summarization purposes.

Initial progress has been reached with https://github.com/aigents/aigents-java-nlp/blob/master/src/main/java/org/aigents/nlp/gen/Segment.java in https://github.com/aigents/aigents-java-nlp/pull/11

Still, there is more work to do to improve the accuracy.

For testing purposes, we can use (for example) the SingularityNET extract from Gutenberg Children corpus used in Unsupervised Language Learning project, using the "cleaned" corpus: http://langlearn.singularitynet.io/data/cleaned/English/Gutenberg-Children-Books/capital/ then creating "extra-cleaned" corpus removing all sentences with quotes and brackets like [ ] ( ) { } ' " and all sentences with inner periods like "CHAPTER I. A NEW DEPARTURE" then gluing sentences together on per-file or per-chapter basis and evaluate the accuracy based on the number of correctly identified sentence boundaries.

Any alternative corpora for testing against any baseline results achieved by any other authors may be considered as well.

References: https://www.researchgate.net/publication/321227216_Text_Segmentation_Techniques_A_Critical_Review https://www.google.com/search?q=natural+language+segmentation%20papers