curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
699 stars 71 forks source link

Whitespace added to captured tokens for PatternSpotter #76

Open cgreathouse opened 1 year ago

cgreathouse commented 1 year ago

Describe the bug When a PatternSpotter uses symbols, the value for the captured token will have additional whitespace

To Reproduce Example for finding possessives

`var possesive = new PatternSpotter(Language.English, 0, "possessive", "Possessive"); possesive.NewPattern("Possessive", p => p.Add( new PatternUnit(PatternUnitPrototype.Single().WithPOS(PartOfSpeech.PROPN, PartOfSpeech.NOUN)), new PatternUnit(PatternUnitPrototype.Single().WithToken("'s")) ));

pipline.Add(possesive); var doc = new Document("The dog's bone", Language.English); pipeline.ProcessSingle(doc);

var tokens= doc.SelectMany(span => span.GetCapturedTokens()).Select(e => new { e.Begin, e.End, e.Value });`

There will be a whitespace between ' and s (i.e. dog' s)

Something similar happens with capturing words wrapped in quotes

Example pattern

var doubleQuoted = new PatternSpotter(Language.English, 0, "double-quoted", "DoubleQuoted"); doubleQuoted.NewPattern("DoubleQuoted", p => p.Add( new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")), new PatternUnit(PatternUnitPrototype.ShouldNotMatch().WithToken("\"")), new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")) ));

Test string : A sentence that "has double quotes" in it

The captured token will have 2 additional whitespaces added (i.e. " has double quotes ")

Expected behavior No additional whitespace (i.e. dog's)