bjherger / ResumeParser

A framework to parse resumes, extract contact & other information, and check for required terms
367 stars 216 forks source link

Resumes not able to be parsed #29

Closed arjunalal closed 5 years ago

arjunalal commented 5 years ago

Hey there, I'm running ResumeParser on Ubuntu 18.04.1 in a VirtualBox and I am running into issues when trying to parse through a set of nine resumes that I have obtained.

My environment sets up fine and the code runs, but when I look at the output .csv file, I'm finding that only three of the resumes are actually able to be parsed, while the rest have 'NOT FOUND' in the text field and blanks for all the skills that I have defined to be extracted in my configuration .yaml file.

When I try the sample resumes provided in the repository, the code parses them perfectly and is able to find relevant text, even with my specified configurations. The code seems to break when I try to use a different set of resumes.

I'm wondering if this is an issue with the resumes that I am using or if perhaps I can change configurations or something else to make the code parse them? Let me know your thoughts. Thanks!

bjherger commented 5 years ago

Interesting. It looks like there might be an issue with either the config file or the way the resumes are stored.

Could you send a copy of your config file?

BH

On Fri, Nov 30, 2018 at 10:11 arjunalal notifications@github.com wrote:

Hey there, I'm running ResumeParser on Ubuntu 18.04.1 in a VirtualBox and I am running into issues when trying to parse through a set of nine resumes that I have obtained.

My environment sets up fine and the code runs, but when I look at the output .csv file, I'm finding that only three of the resumes are actually able to be parsed, while the rest have 'NOT FOUND' in the text field and blanks for all the skills that I have defined to be extracted in my configuration .yaml file.

When I try the sample resumes provided in the repository, the code parses them perfectly and is able to find relevant text, even with my specified configurations. The code seems to break when I try to use a different set of resumes.

I'm wondering if this is an issue with the resumes that I am using or if perhaps I can change configurations or something else to make the code parse them? Let me know your thoughts. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bjherger/ResumeParser/issues/29, or mute the thread https://github.com/notifications/unsubscribe-auth/AHXyreGLfG9lt-Q4eo1LlQR9CG-w3lkZks5u0XTjgaJpZM4Y8JQ4 .

-- Thanks, Brendan Herger 415.326.3513

arjunalal commented 5 years ago

Hey, thanks for getting back so quickly. Github doesn't support attaching .yaml files so I changed it to a .txt file before attaching it to this message. Let me know if there are any issues and I can email you the original config.yaml file.

config.txt

bjherger commented 5 years ago

This might be an issue with the data that is being read in, either that the resumes are corrupted, or aren't OCR-able, or are in a non-english language. Could you check one of the resumes that isn't parsing, and confirm that you're able to copy and paste text from it, and / or that it is readable?

arjunalal commented 5 years ago

Thank you for your insight, it does indeed seem to be the case that I am unable to copy and paste text from the resumes in my set that failed extraction.

I am unclear to why this is as there are no security features on the PDF files and I'm relatively certain that they were created from text, not scanned in. Would you be able to speculate on any other reasons why certain PDFs would be unable to have text extracted?

bjherger commented 5 years ago

That's a great question. It depends a lot on the software that's rendering text to PDF. By default, most will include text and markup, but some programs can't (or don't by default) include this information, and instead only include images.

Images are sometimes usable via OCR, and ResumeParser does its best to use OCR to extract text from images. However, this process has a roughly ~15% failure rate (as best you can do w/o expensive software).

On Wed, Dec 5, 2018 at 8:21 AM arjunalal notifications@github.com wrote:

Thank you for your insight, it does indeed seem to be the case that I am unable to copy and paste text from the resumes in my set that failed extraction.

I am unclear to why this is as there are no security features on the PDF files and I'm relatively certain that they were created from text, not scanned in. Would you be able to speculate on any other reasons why certain PDFs would be unable to have text extracted?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/bjherger/ResumeParser/issues/29#issuecomment-444546182, or mute the thread https://github.com/notifications/unsubscribe-auth/AHXyrSn3QyYMOSUrzm7ocFI37Ykba165ks5u1_J9gaJpZM4Y8JQ4 .

-- Thanks, Brendan Herger 415.326.3513

arjunalal commented 5 years ago

Thanks so much for all your help, I really appreciate the hard work you've put into this project and the advice you've given me. I'm gonna go ahead and close this issue. Cheers!