JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 711 forks source link

tokenizer setInfixPatterns overrides setContextChars #13819

Closed reuth-goldstein closed 10 months ago

reuth-goldstein commented 1 year ago

I have this tokenizer as part of my pipeline:

tokenizer = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") \ .setContextChars([".", ",", ";", ":", "!", "?", "(", ")", "\"", "'", "+", "%", "-", '=']) \ .setSplitChars(['[', ']']) \ .setInfixPatterns([ r"\b(?<!(?:-|=|+|\/|*|[.]|\d))(\d[.]\d[|x|\/]?\d[.]\d+)(mGy*cm|liters|mg\/m2|mg\/m2|mg\/kg|ml\/hr|mmHg|g\/dL|mGy|mls|neg|pos|lbs|lb|ml|mL|Ht|cm|mm|mg|kg|mg|CM|G|g|F|L|U|C|c|m)(?!(?:-|=|+|\/|*))\b", r"\b(?<!(?:-|=|+|\/|*|[.]|\d))(\d[.]\d[|x|\/]?\d[.]\d+)(%|+|-)"])

I've encounter a weird bug that once setInfixPatterns is defined the setContextChars is being ignored. Is this the expected behavior?

for example on that text: "in HEENT: Sclerae anicteric" I'm getting this tokens: 'in', 'HEENT:', 'Sclerae', 'anicteric' --> HEENT is coming with colon even though I'm using setContextChars with colon defined there.

danilojsl commented 1 year ago

Hi @reuth-goldstein

The parameter setContextChars is ignored when using setInfixPatterns as stated in ScalaDoc Spark NLP 4.4.3 ScalaDoc - com.johnsnowlabs.nlp.annotators.Tokenizer

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days