OmkarPathak / pyresparser

A simple resume parser used for extracting information from resumes
GNU General Public License v3.0
808 stars 411 forks source link

Extraction of Names is not always correct #23

Closed arvindrkrishnen closed 4 years ago

arvindrkrishnen commented 4 years ago

I have ran this module through bunch of 1000+ resumes and the name extraction itself is not always correct. Hardly 200 resumes extraction was successful. Have you validated this code against variety of resume samples?

OmkarPathak commented 4 years ago

@arvindrkrishnen yes I am aware of it. The thing is, resumes do not have a specific format and it is really hard to train a model for such a variety of resumes. But for many of the "standard" resumes, I think name extraction is fairly good. Please suggest any better methods for the same if you have 😄

arvindrkrishnen commented 4 years ago

Can you create a custom NER model using the data to recognize Indian names. Download datasets from here - http://au-kbc.org/nlp/NER-FIRE2013/

Additional datasets available at https://github.com/piyusharma95/NER-for-Hindi

OmkarPathak commented 4 years ago

@arvindrkrishnen thanks for the links. Sure I will try to train NER on the above-mentioned dataset. Will let you know once done

arvindrkrishnen commented 4 years ago

what worked well for me is to use Apache Tika for extracting cleaner text from PDF and word document. Just add this component in your code.

import tika
from tika import parser
#download and install TIKA server from https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-RunningtheTikaServerasaJarfile
#run TIKA server before you start the below command line
file_data = parser.from_file(file, 'http://localhost:9998/tika')
text = file_data['content']
OmkarPathak commented 4 years ago

@arvindrkrishnen thanks for the help. I did not find tika to be very effective in terms of text extraction. It is fairy similar to pdfminer and hence we have decided not to include this in pyresparser as it adds one layer of dependency.