Open K05730 opened 5 years ago
Think it's quite a great idea, everyone that is looking for job for sure has a CV. It would simplify the sign up process, which can sometimes be too boring to a point where the user could give up on using the platform (I've personally felt this on other platforms where I had to, for example, add all my skills).
@DevvKris @nocategory I have moved this issue to feedback repo, as it would require changes on API side.
We can keep discussing it here. :)
Let me post here small update with more information. :slightly_smiling_face:
After few internal discussions we cleared out that not only this is possible to do without modifying the API, but also that we would probably prefer to do it in this way.
The current API already exposes mutations to change profile data (like skills and such), so document parsing/profile import can be done either completely in browser (if it's practical) or as separate small service which we can host.
Big benefit of both approaches is that both can be done as opensource projects.
So, if anyone is willing to help with this, please let us know what way would you prefer.
Links to any resources for doing it on the browser side are also welcome, as well as any thoughts on security and performance side of it. I think we would like to avoid going with a solution that compromises security of community members or makes browser to slow your device to a halt. :slightly_smiling_face:
That’s great news!
I am more of a back-end person, so I am a bit biased on this matter. I am more inclined in favor of a separate microservice, but I'm not sure about the security aspect of this decision.
The reason I favor this approach at this point is that it separates concerns of components inside the system: candidate-web
provides the service of representation, the back-end API provides the service of the core business logic.
I have, however, coded up a small example for front-end file upload in vanilla JS (har-har) to prove the concept. It expects text files, but I know for certain this can be extended to images. Given an appropriate parsing library, it can be extended to PDF files as well:
https://codepen.io/lilsweetcaligula/pen/WmRKVy
P.S. Do I understand correctly we can use this on frontend with something like webpack?
I am more of a back-end person, so I am a bit biased on this matter. I am more inclined in favor of a separate microservice
That's our point of view right now as well, yet it really depends on community I think. Also, given links that you provide, I have the feeling that it will be easier and faster to try out implementing it in browser first. If it will not work out, we can fallback to backend implementation.
but I'm not sure about the security aspect of this decision.
If we'll roll with implementation in browser, my guess would be that the big question would be "can malicious PDF hurt user in any way?". But, on the other hand, Firefox uses similar approach and is considering that to be secure, and we do not expect users to try to parse random PDFs from malicious actors. (Should we? :slightly_smiling_face: )
If we roll with implementation on the backend, then it's mostly about trust that backend does what expected with PDF and nothing more. If the service will be opensource, everyone can see what is it doing with provided data. Then it's mostly the trust that we haven't made any malicious adjustments while deploying it.
I am not aware of any interest on our part to hoard PDFs, and given the whole direction EU takes with protection of personal data (see GDPR for example), my guess would be that it is in our best interest to not store them.
Yep, FileReader can be used to read file that user provides.
But instead of fileReader.readAsText(file)
we can do fileReader.readAsBinaryString(file)
or fileReader.readAsArrayBuffer(file)
, depends on what will work best to get data to pass to pdf-parse
.
P.S. Do I understand correctly we can use this on frontend with something like webpack?
To me right now it look like we can. It uses pdf-js which should work in browsers (Firefox uses it).
If anyone would try it out and provide some kind of proof-of-concept of extracting text from PDF uploaded via FileReader, we would be grateful. :slightly_smiling_face:
@ivan-kolmychek-devv I tried to get a proof of concept and sort of succeeded. I successfully set up the upload form, selected a .pdf
file and had its output logged to the console.
However, there's one problem that I may need help with. The pdf.js
package which is the basis for all of the PDF libraries, requires a worker, to which it sends GET requests, e.g.:
express:router dispatching GET /static/pdf.worker.js
express:router dispatching GET /0.js
Two issues:
text/html
with the Vue's app scaffold. The pdf.js
package complains because it expects application/javascript
. Surprisingly it still parses the pdf file and logs output.0.js
script is apparently generated by the pdf.js
package. I have not figured out what 0.js
does exactly, or how to customize its generation path. Nothing is spawned in the root directory. While seemingly not causing issues, I'd rather know what it is.I coded the logic dependent directly on the pdf.js
interface because packages like pdf-parse
and pdf2json
caused issues with setting them up. It should not be a problem to resolve the issues with the packages, however.
I have not submitted a PR because of the outlined issues.
Interesting.
I am not sure I have answers to these issues right now, as I don't have any experience with these libs. I hope someone else will pop up in thread that does, but until then let's try to sort it out. :slightly_smiling_face:
I have not submitted a PR because of the outlined issues.
Not all the contributions have to be in form of PRs. You have spent your time researching the issue and wrote here about results that you wrote. To me this sounds like a sizable contribution already. :slightly_smiling_face:
I coded the logic dependent directly on the pdf.js interface because packages like pdf-parse and pdf2json caused issues with setting them up. It should not be a problem to resolve the issues with the packages, however.
Can you share which issues did you stumbled upon? This way either someone can share the way to resolve them, or at least everyone else who wants to take a shot at an issue will be warned about them. :slightly_smiling_face:
However, there's one problem that I may need help with. The pdf.js package which is the basis for all of the PDF libraries, requires a worker, to which it sends GET requests ...
Just to clarify - worker as in separate process running on users machine
? Or ServiceWorkers? Or WebWorkers?
the worker is served as text/html with the Vue's app scaffold. The pdf.js package complains because it expects application/javascript. Surprisingly it still parses the pdf file and logs output.
Do I understand correctly that there is an issue with how the webpack-dev-server
serves the files from pdf.js
? Or are we talking about any other (separate) server that you have to run?
Please let me know if I misunderstand something, I am pretty sure your knowledge and experience on this right now are bigger than mine. :slightly_smiling_face:
M'eh, the culprit was a typo in the worker's path.
HTML was rendered because when no path was found, the app would reply with a basic HTML scaffold (not the 404 page). I solved (?) the problem by requiring pdfjs-dist/webpack
to have the worker configured automatically.
If needed, you can actually copy the worker over to the static folder and point the library to it, I explained how in the comment in the source code inside the CvUpload
component.
By the way, the mysterious GET
request to 0.js
disappeared once I fixed the typo. Whether a worker is auto-configured or not.
P. S.: I found the Express router debug output to be quite useful. Just add DEBUG=express:router
to the candidates
script in package.json
. Additionally, you can define a separate script for convenience:
"scripts": {
"candidates-debug": "cross-env APP_NAME=candidates DEBUG=express:router webpack-dev-server --inline --progress --config build/webpack.dev.conf.js",
}
Parsing from https://github.com/DevvJobs/candidates-web/pull/34 (thanks @lilsweetcaligula) is now enabled in development version. Right now it extracts all the text from PDF.
We have briefly discussed internally how we should proceed - we think it's best to do so with skills and that we should have an extra step before submitting results, so it's easy to see what exactly skills will be created, delete what you don't want to be sent, fill in any missing information, fix anything that's wrong and so on.
Our suggestion would be to do it in these steps:
Feedback and suggestions are welcome.
Created https://github.com/DevvJobs/candidates-web/issues/37 for step A of the plan.
As part of making signup as easy as possible, we are thinking if there would be any benefit of adding an uploaded document parsing and the ability to add a link to be sourced and imported. We are not sure, what do you think?