🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
743
stars
75
forks
source link
Whitespace added to captured tokens for PatternSpotter #76
Describe the bug
When a PatternSpotter uses symbols, the value for the captured token will have additional whitespace
To Reproduce
Example for finding possessives
`var possesive = new PatternSpotter(Language.English, 0, "possessive", "Possessive");
possesive.NewPattern("Possessive", p => p.Add(
new PatternUnit(PatternUnitPrototype.Single().WithPOS(PartOfSpeech.PROPN, PartOfSpeech.NOUN)),
new PatternUnit(PatternUnitPrototype.Single().WithToken("'s"))
));
pipline.Add(possesive);
var doc = new Document("The dog's bone", Language.English);
pipeline.ProcessSingle(doc);
var tokens= doc.SelectMany(span => span.GetCapturedTokens()).Select(e => new
{
e.Begin,
e.End,
e.Value
});`
There will be a whitespace between ' and s (i.e. dog' s)
Something similar happens with capturing words wrapped in quotes
Example pattern
var doubleQuoted = new PatternSpotter(Language.English, 0, "double-quoted", "DoubleQuoted"); doubleQuoted.NewPattern("DoubleQuoted", p => p.Add( new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")), new PatternUnit(PatternUnitPrototype.ShouldNotMatch().WithToken("\"")), new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")) ));
Test string : A sentence that "has double quotes" in it
The captured token will have 2 additional whitespaces added (i.e. " has double quotes ")
Expected behavior
No additional whitespace (i.e. dog's)
Describe the bug When a PatternSpotter uses symbols, the value for the captured token will have additional whitespace
To Reproduce Example for finding possessives
`var possesive = new PatternSpotter(Language.English, 0, "possessive", "Possessive"); possesive.NewPattern("Possessive", p => p.Add( new PatternUnit(PatternUnitPrototype.Single().WithPOS(PartOfSpeech.PROPN, PartOfSpeech.NOUN)), new PatternUnit(PatternUnitPrototype.Single().WithToken("'s")) ));
pipline.Add(possesive); var doc = new Document("The dog's bone", Language.English); pipeline.ProcessSingle(doc);
var tokens= doc.SelectMany(span => span.GetCapturedTokens()).Select(e => new { e.Begin, e.End, e.Value });`
There will be a whitespace between ' and s (i.e. dog' s)
Something similar happens with capturing words wrapped in quotes
Example pattern
var doubleQuoted = new PatternSpotter(Language.English, 0, "double-quoted", "DoubleQuoted"); doubleQuoted.NewPattern("DoubleQuoted", p => p.Add( new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")), new PatternUnit(PatternUnitPrototype.ShouldNotMatch().WithToken("\"")), new PatternUnit(PatternUnitPrototype.Single().WithToken("\"")) ));
Test string : A sentence that "has double quotes" in it
The captured token will have 2 additional whitespaces added (i.e. " has double quotes ")
Expected behavior No additional whitespace (i.e. dog's)