Discussion: Parsing of documents/Link scraping.

K05730 commented 5 years ago

As part of making signup as easy as possible, we are thinking if there would be any benefit of adding an uploaded document parsing and the ability to add a link to be sourced and imported. We are not sure, what do you think?

nocategory commented 5 years ago

Think it's quite a great idea, everyone that is looking for job for sure has a CV. It would simplify the sign up process, which can sometimes be too boring to a point where the user could give up on using the platform (I've personally felt this on other platforms where I had to, for example, add all my skills).

ivan-kolmychek-devv commented 5 years ago

@DevvKris @nocategory I have moved this issue to feedback repo, as it would require changes on API side.

We can keep discussing it here. :)

ivan-kolmychek-devv commented 5 years ago

Let me post here small update with more information. :slightly_smiling_face:

After few internal discussions we cleared out that not only this is possible to do without modifying the API, but also that we would probably prefer to do it in this way.

The current API already exposes mutations to change profile data (like skills and such), so document parsing/profile import can be done either completely in browser (if it's practical) or as separate small service which we can host.

Big benefit of both approaches is that both can be done as opensource projects.

So, if anyone is willing to help with this, please let us know what way would you prefer.

Links to any resources for doing it on the browser side are also welcome, as well as any thoughts on security and performance side of it. I think we would like to avoid going with a solution that compromises security of community members or makes browser to slow your device to a halt. :slightly_smiling_face:

K05730 commented 5 years ago

That’s great news!

lilsweetcaligula commented 5 years ago

I am more of a back-end person, so I am a bit biased on this matter. I am more inclined in favor of a separate microservice, but I'm not sure about the security aspect of this decision.

The reason I favor this approach at this point is that it separates concerns of components inside the system: candidate-web provides the service of representation, the back-end API provides the service of the core business logic.

I have, however, coded up a small example for front-end file upload in vanilla JS (har-har) to prove the concept. It expects text files, but I know for certain this can be extended to images. Given an appropriate parsing library, it can be extended to PDF files as well:

https://codepen.io/lilsweetcaligula/pen/WmRKVy

P.S. Do I understand correctly we can use this on frontend with something like webpack?

https://www.npmjs.com/package/pdf-parse

ivan-kolmychek-devv commented 5 years ago

I am more of a back-end person, so I am a bit biased on this matter. I am more inclined in favor of a separate microservice

That's our point of view right now as well, yet it really depends on community I think. Also, given links that you provide, I have the feeling that it will be easier and faster to try out implementing it in browser first. If it will not work out, we can fallback to backend implementation.

but I'm not sure about the security aspect of this decision.

If we'll roll with implementation in browser, my guess would be that the big question would be "can malicious PDF hurt user in any way?". But, on the other hand, Firefox uses similar approach and is considering that to be secure, and we do not expect users to try to parse random PDFs from malicious actors. (Should we? :slightly_smiling_face: )

If we roll with implementation on the backend, then it's mostly about trust that backend does what expected with PDF and nothing more. If the service will be opensource, everyone can see what is it doing with provided data. Then it's mostly the trust that we haven't made any malicious adjustments while deploying it.

I am not aware of any interest on our part to hoard PDFs, and given the whole direction EU takes with protection of personal data (see GDPR for example), my guess would be that it is in our best interest to not store them.

https://codepen.io/lilsweetcaligula/pen/WmRKVy

Yep, FileReader can be used to read file that user provides.

But instead of fileReader.readAsText(file) we can do fileReader.readAsBinaryString(file) or fileReader.readAsArrayBuffer(file), depends on what will work best to get data to pass to pdf-parse.

P.S. Do I understand correctly we can use this on frontend with something like webpack?

https://www.npmjs.com/package/pdf-parse

To me right now it look like we can. It uses pdf-js which should work in browsers (Firefox uses it).

If anyone would try it out and provide some kind of proof-of-concept of extracting text from PDF uploaded via FileReader, we would be grateful. :slightly_smiling_face:

lilsweetcaligula commented 5 years ago

@ivan-kolmychek-devv I tried to get a proof of concept and sort of succeeded. I successfully set up the upload form, selected a .pdf file and had its output logged to the console.

However, there's one problem that I may need help with. The pdf.js package which is the basis for all of the PDF libraries, requires a worker, to which it sends GET requests, e.g.:

express:router dispatching GET /static/pdf.worker.js
express:router dispatching GET /0.js

Two issues:

the worker is served as text/html with the Vue's app scaffold. The pdf.js package complains because it expects application/javascript. Surprisingly it still parses the pdf file and logs output.
the 0.js script is apparently generated by the pdf.js package. I have not figured out what 0.js does exactly, or how to customize its generation path. Nothing is spawned in the root directory. While seemingly not causing issues, I'd rather know what it is.

I coded the logic dependent directly on the pdf.js interface because packages like pdf-parse and pdf2json caused issues with setting them up. It should not be a problem to resolve the issues with the packages, however.

I have not submitted a PR because of the outlined issues.

ivan-kolmychek-devv commented 5 years ago

Interesting.

I am not sure I have answers to these issues right now, as I don't have any experience with these libs. I hope someone else will pop up in thread that does, but until then let's try to sort it out. :slightly_smiling_face:

I have not submitted a PR because of the outlined issues.

Not all the contributions have to be in form of PRs. You have spent your time researching the issue and wrote here about results that you wrote. To me this sounds like a sizable contribution already. :slightly_smiling_face:

I coded the logic dependent directly on the pdf.js interface because packages like pdf-parse and pdf2json caused issues with setting them up. It should not be a problem to resolve the issues with the packages, however.

Can you share which issues did you stumbled upon? This way either someone can share the way to resolve them, or at least everyone else who wants to take a shot at an issue will be warned about them. :slightly_smiling_face:

However, there's one problem that I may need help with. The pdf.js package which is the basis for all of the PDF libraries, requires a worker, to which it sends GET requests ...

Just to clarify - worker as in separate process running on users machine? Or ServiceWorkers? Or WebWorkers?

the worker is served as text/html with the Vue's app scaffold. The pdf.js package complains because it expects application/javascript. Surprisingly it still parses the pdf file and logs output.

Do I understand correctly that there is an issue with how the webpack-dev-server serves the files from pdf.js? Or are we talking about any other (separate) server that you have to run?

Please let me know if I misunderstand something, I am pretty sure your knowledge and experience on this right now are bigger than mine. :slightly_smiling_face:

lilsweetcaligula commented 5 years ago

M'eh, the culprit was a typo in the worker's path.

HTML was rendered because when no path was found, the app would reply with a basic HTML scaffold (not the 404 page). I solved (?) the problem by requiring pdfjs-dist/webpack to have the worker configured automatically.

If needed, you can actually copy the worker over to the static folder and point the library to it, I explained how in the comment in the source code inside the CvUpload component.

By the way, the mysterious GET request to 0.js disappeared once I fixed the typo. Whether a worker is auto-configured or not.

P. S.: I found the Express router debug output to be quite useful. Just add DEBUG=express:router to the candidates script in package.json. Additionally, you can define a separate script for convenience:

"scripts": {
  "candidates-debug": "cross-env APP_NAME=candidates DEBUG=express:router webpack-dev-server --inline --progress --config build/webpack.dev.conf.js",
}

ivan-kolmychek-devv commented 5 years ago

Parsing from https://github.com/DevvJobs/candidates-web/pull/34 (thanks @lilsweetcaligula) is now enabled in development version. Right now it extracts all the text from PDF.

We have briefly discussed internally how we should proceed - we think it's best to do so with skills and that we should have an extra step before submitting results, so it's easy to see what exactly skills will be created, delete what you don't want to be sent, fill in any missing information, fix anything that's wrong and so on.

Our suggestion would be to do it in these steps:

A. Add extraction of skills from the PDF text, they can be put into state. Display them on page just as static text, for debug purposes and so other community members can jump in, test it out and help.
B. Add code that will try to find those skills by making API queries, result can be put to state. Display found skills as well, for debug purposes.
C. Add ability to fill in missing data for each skill - for example, while it is possible to detect years of experience in particular skill, we think it's better to leave it for later, so for now we can ask to fill them out manually.
D. Add code to send the result - found skills with all the required data - to server.
E. Improve from there

Feedback and suggestions are welcome.

ivan-kolmychek-devv commented 5 years ago

Created https://github.com/DevvJobs/candidates-web/issues/37 for step A of the plan.

DevvJobs / candidates-feedback

Discussion: Parsing of documents/Link scraping. #2