bigwhoop / sentence-breaker

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.
MIT License
41 stars 6 forks source link

Is it possible to extend custom token? #11

Open event15 opened 3 years ago

event15 commented 3 years ago

I need to add as period equivalent html tokens (in my case) such as: br, hr. Is it possible to extend Lexer from outside the code of this plugin?

event15 commented 3 years ago

I was able to add a new T_PERIOD_BR token in my project, but the capabilities of this script are very limited in terms of extensibility.

I created myself a BrPeriodToken file in which I set BrPeriodToken::getPrintableValue to <br>.

final class BrPeriodToken extends ValueToken
{
    public function getName(): string
    {
        return 'T_PERIOD_BR';
    }
}

Then - unfortunately - I had to create my own CustomLexer. And this is not the end. Unfortunately, the $input, $state and other fields are private, so I had to copy them 1:1 into my CustomLexer. There is no interface, or abstract class, or even protected elements. I also did not see an abstract builder for the Lexer. The only thing I had to edit in that file was the reset() method:

private function reset(): void
    {
        $this->pos = 0;
        $this->tokenPos = 0;
        $this->tokens = [];
        $this->state = new MyNamespace\TextState;
    }

I had to create my own rules.ini file with additional rules:

<T_PERIOD_BR> T_WHITESPACE T_CAPITALIZED_WORD = 75                                               ; <br> Word
<T_PERIOD_BR> T_CAPITALIZED_WORD = 75                                                            ; <br>Word
T_CAPITALIZED_WORD <T_PERIOD_BR> T_CAPITALIZED_WORD = 75                                         ; Word<br>Word
T_WORD <T_PERIOD_BR> T_CAPITALIZED_WORD = 75                                                     ; word<br>Word
T_CAPITALIZED_WORD <T_PERIOD_BR> T_WHITESPACE T_CAPITALIZED_WORD = 75                            ; Word<br> Word
T_WORD <T_PERIOD_BR> T_WHITESPACE T_CAPITALIZED_WORD = 75                                        ; word<br> Word
T_CAPITALIZED_WORD T_WHITESPACE <T_PERIOD_BR> T_WHITESPACE T_CAPITALIZED_WORD = 75               ; Word <br> Word
T_WORD T_WHITESPACE <T_PERIOD_BR> T_WHITESPACE T_CAPITALIZED_WORD = 75                           ; word <br> Word

Then I was forced to copy all the States from the vendor and only then edit the WordState to emit BrPeriodToken. This was because every State returns a TextState object which I overwrote. Unsubscribing these classes with their own caused problems.

I may have overcomplicated it. If anyone knows an easier way to add a token that supports certain HTML tags, I'd love to know.