Closed reuth-goldstein closed 10 months ago
Hi @reuth-goldstein
The parameter setContextChars
is ignored when using setInfixPatterns
as stated in ScalaDoc Spark NLP 4.4.3 ScalaDoc - com.johnsnowlabs.nlp.annotators.Tokenizer
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days
I have this tokenizer as part of my pipeline:
tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") \ .setContextChars([".", ",", ";", ":", "!", "?", "(", ")", "\"", "'", "+", "%", "-", '=']) \ .setSplitChars(['[', ']']) \ .setInfixPatterns([ r"\b(?<!(?:-|=|+|\/|*|[.]|\d))(\d[.]\d[|x|\/]?\d[.]\d+)(mGy*cm|liters|mg\/m2|mg\/m2|mg\/kg|ml\/hr|mmHg|g\/dL|mGy|mls|neg|pos|lbs|lb|ml|mL|Ht|cm|mm|mg|kg|mg|CM|G|g|F|L|U|C|c|m)(?!(?:-|=|+|\/|*))\b", r"\b(?<!(?:-|=|+|\/|*|[.]|\d))(\d[.]\d[|x|\/]?\d[.]\d+)(%|+|-)"])
I've encounter a weird bug that once setInfixPatterns is defined the setContextChars is being ignored. Is this the expected behavior?
for example on that text: "in HEENT: Sclerae anicteric" I'm getting this tokens: 'in', 'HEENT:', 'Sclerae', 'anicteric' --> HEENT is coming with colon even though I'm using setContextChars with colon defined there.