OmkarPathak / pyresparser

A simple resume parser used for extracting information from resumes
GNU General Public License v3.0
774 stars 395 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 0: invalid continuation byte #32

Closed puttapraneeth closed 3 years ago

puttapraneeth commented 4 years ago

This is absolutely great.

Using pyresparser package I am able to extract the fields from a resume. To check the implementation downloaded code and did the setup as mentioned. When executed with the same resume it ended with error, details are below. Resume used for this doesn't contain any images and it is working with pyresparser.

Command: python resume_parser.py

Traceback (most recent call last): File "resume_parser.py", line 133, in data = ResumeParser('OmkarResume.pdf').get_extracted_data() File "resume_parser.py", line 20, in init custom_nlp = spacy.load(os.path.dirname(os.path.abspath(file))) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy__init.py", line 27, in load return util.load_model(name, overrides) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 133, in load_model return load_model_from_path(Path(name), overrides) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 173, in load_model_from_path return nlp.from_disk(model_path) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\language.py", line 791, in from_disk util.from_disk(path, deserializers, exclude) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 630, in from_disk reader(path / key) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\language.py", line 781, in deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"]) File "tokenizer.pyx", line 391, in spacy.tokenizer.Tokenizer.from_disk File "tokenizer.pyx", line 432, in spacy.tokenizer.Tokenizer.from_bytes File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 606, in from_bytes msg = srsly.msgpack_loads(bytes_data) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\srsly_msgpack_api.py", line 29, in msgpack_loads msg = msgpack.loads(data, raw=False, use_list=use_list) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\srsly\msgpack\init__.py", line 60, in unpackb return _unpackb(packed, **kwargs) File "_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 0: invalid continuation byte

Unable to understand why it is failing. Need your help in resolving this.

Thanks, Praneeth

OmkarPathak commented 4 years ago

That's weird. Are you sure you used python resume_parser.py command only? Were there any other parameters passed while executing the mentioned command?

puttapraneeth commented 4 years ago

Exactly same. Did not add any parameters

Nothing is getting executed after this line custom_nlp = spacy.load(os.path.dirname(os.path.abspath(file)))

Tried with both pdf and docx file types which are working with pyresparser, but not here

Using windows 10. They have explained this issue here

puttapraneeth commented 4 years ago

So should I train it again on windows 10 and try as explained here

OmkarPathak commented 4 years ago

@puttapraneeth I have already tested the same on Windows. It has no such problems. No need to re-train it on windows.

OmkarPathak commented 4 years ago

Can you please paste here the code you are using in __main__ in resume_parser.py file

puttapraneeth commented 4 years ago

Definitely.

if name == 'main':

pool = mp.Pool(mp.cpu_count())

resumes = []
data = []
for root, directories, filenames in os.walk('resumes/'):
    for filename in filenames:
        file = os.path.join(root, filename)
        resumes.append(file)

results = [
    pool.apply_async(
        resume_result_wrapper,
        args=(x,)
    ) for x in resumes
]

results = [p.get() for p in results]

pprint.pprint(results)
puttapraneeth commented 4 years ago

error

multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\multiprocessing\pool.py", line 121, in worker result = (True, func(*args, kwds)) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\pyresparser\pyresparser\resume_parser.py", line 131, in resume_result_wrapper parser = ResumeParser(resume) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\pyresparser\pyresparser\resume_parser.py", line 20, in init custom_nlp = spacy.load(os.path.dirname(os.path.abspath(file))) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy__init__.py", line 27, in load return util.load_model(name, overrides) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 133, in load_model return load_model_from_path(Path(name), overrides) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 173, in load_model_from_path return nlp.from_disk(model_path) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\language.py", line 791, in from_disk util.from_disk(path, deserializers, exclude) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 630, in from_disk reader(path / key) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\language.py", line 781, in deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"]) File "tokenizer.pyx", line 391, in spacy.tokenizer.Tokenizer.from_disk File "tokenizer.pyx", line 432, in spacy.tokenizer.Tokenizer.from_bytes File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 606, in from_bytes msg = srsly.msgpack_loads(bytes_data) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\srsly_msgpack_api.py", line 29, in msgpack_loads msg = msgpack.loads(data, raw=False, use_list=use_list) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\srsly\msgpack__init__.py", line 60, in unpackb return _unpackb(packed, kwargs) File "_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 0: invalid continuation byte """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "resume_parser.py", line 156, in results = [p.get() for p in results] File "resume_parser.py", line 156, in results = [p.get() for p in results] File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\multiprocessing\pool.py", line 657, in get raise self._value UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 0: invalid continuation byte

puttapraneeth commented 4 years ago

Tried this as well which didn't work

if name == 'main':

data = ResumeParser('OmkarResume.pdf').get_extracted_data()
print(data)

command - python resume_parser.py Error: Traceback (most recent call last): File "resume_parser.py", line 138, in data = ResumeParser('OmkarResume.pdf').get_extracted_data() File "resume_parser.py", line 20, in init custom_nlp = spacy.load(os.path.dirname(os.path.abspath(file))) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy__init.py", line 27, in load return util.load_model(name, overrides) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 133, in load_model return load_model_from_path(Path(name), overrides) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 173, in load_model_from_path return nlp.from_disk(model_path) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\language.py", line 791, in from_disk util.from_disk(path, deserializers, exclude) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 630, in from_disk reader(path / key) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\language.py", line 781, in deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(p, exclude=["vocab"]) File "tokenizer.pyx", line 391, in spacy.tokenizer.Tokenizer.from_disk File "tokenizer.pyx", line 432, in spacy.tokenizer.Tokenizer.from_bytes File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\spacy\util.py", line 606, in from_bytes msg = srsly.msgpack_loads(bytes_data) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\srsly_msgpack_api.py", line 29, in msgpack_loads msg = msgpack.loads(data, raw=False, use_list=use_list) File "C:\Users\Praneeth\Anaconda3\envs\pyparser\lib\site-packages\srsly\msgpack\init__.py", line 60, in unpackb return _unpackb(packed, **kwargs) File "_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position 0: invalid continuation byte

OmkarPathak commented 4 years ago

I see. in this case we need to try re-training

puttapraneeth commented 4 years ago

Am getting the below issue when I started with training

ValueError: [E103] Trying to set conflicting doc.ents: '(1155, 1199, 'Email Address')' and '(1143, 1240, 'Links')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

How to resolve this overlap issue?

puttapraneeth commented 4 years ago

First time it failed at

assert nlp2.get_pipe("ner").move_names == move_names

'O' is present initially in move names of new model but not in original moves names. So the mismatch and failed.

When ran again getting the overlap issue. Searched about overlap issue but I didn't understand. This might be a silly issue but I am a newbie, could you help me resolving this issue.

Searched about overlap issue but I didn't understand

Thanks.

OmkarPathak commented 4 years ago

@puttapraneeth not sure about this error. Can you try raising this issue on Spacy's issue tracker please?

puttapraneeth commented 3 years ago

Raised this with Spacy's issue tracker, didn't receive any response from them. Hence closing this one

Thanks, Praneeth

Datapirate-98 commented 1 year ago

Hi @puttapraneeth,

Did you get the solution to this error?

Datapirate-98 commented 1 year ago

First time it failed at

assert nlp2.get_pipe("ner").move_names == move_names

'O' is present initially in move names of new model but not in original moves names. So the mismatch and failed.

When ran again getting the overlap issue. Searched about overlap issue but I didn't understand. This might be a silly issue but I am a newbie, could you help me resolving this issue.

Searched about overlap issue but I didn't understand

Thanks.

Hi @puttapraneeth, Can u please paste the corrected moves file content here?

Or

Can u please just specify can I resolve this error- assert nlp2.get_pipe("ner").move_names == move_names