aphp / edsnlp

Modular, fast NLP framework, compatible with Pytorch and spaCy, offering tailored support for French clinical notes.
https://aphp.github.io/edsnlp/
BSD 3-Clause "New" or "Revised" License
113 stars 29 forks source link

EDSTokenizer: split on non-breaking spaces and don't split float numbers #141

Closed percevalw closed 2 years ago

percevalw commented 2 years ago

Description

Fix to handle non breaking whitespaces and tabs as mentioned in #140, and stop splitting floating point numbers.

Checklist

codecov-commenter commented 2 years ago

Codecov Report

Base: 94.32% // Head: 94.32% // No change to project coverage :thumbsup:

Coverage data is based on head (22be54d) compared to base (366931d). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #141 +/- ## ======================================= Coverage 94.32% 94.32% ======================================= Files 164 164 Lines 4704 4704 ======================================= Hits 4437 4437 Misses 267 267 ``` | [Impacted Files](https://codecov.io/gh/aphp/edsnlp/pull/141?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp) | Coverage Δ | | |---|---|---| | [edsnlp/language.py](https://codecov.io/gh/aphp/edsnlp/pull/141/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp#diff-ZWRzbmxwL2xhbmd1YWdlLnB5) | `97.87% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=aphp)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.