explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Text tokenizer is classifying the letter "O" as punctuation. #13221

Closed dorncg18 closed 9 months ago

dorncg18 commented 9 months ago

I am using the following code:

doc = nlp(text)
for token in doc:
    if token.pos_ == 'PUNCT':
        text = text.replace(token.text, '')

with the following raw text, read from a PDF using pyPDF

"with a proven track record of delivering strategic financial solutions for clients. Highly accomplished"

it is being converted to

"with a prven track recrd f delivering strategic financial slutins fr clients Highly accmplished"

I noted this behavior to the creator of the package I am using Resume Matcher, but I can keep the letter "O" in the output using this workaround:

doc = nlp(text)
for token in doc:
    if token.pos_ == 'PUNCT' and token.text != 'o':
        text = text.replace(token.text, '')

There may be an issue as to how the text is being read in from pyPDF, but looking at the results when using the pyPDF function, the text looks correct.

Info about spaCy

Python 3.9.0 Windows 10

svlandeg commented 9 months ago

Hi! Let me transfer this to the discussion forum and answer you there.