PDF Parser - Githubissues

codeforboston / windfall-elimination

Windfall Elimination Provision Awareness Project: improving the experience of retirees around an obscure Social Security rule affecting 8+ states (previously https://ssacalculator.org)

https://windfall-develop.netlify.app/

MIT License

24 stars 45 forks source link

PDF Parser #12

Open nvanwitt opened 5 years ago

nvanwitt commented 5 years ago

We need a functioning PDF parser for an individuals earning records.

Status? If we have working code:
- Test with large dataset (team's earning records).
- Add to React App. If not:
- Write PDF parser (Tesseract?) -Test sets: Individuals Earnings Records (get from MySSA)

dylanesque commented 5 years ago

https://www.gatsbyjs.org/packages/gatsby-transformer-pdf/?=pdf

dylanesque commented 5 years ago

The question I have for this is, do we know where data from uploaded files like PDFs or XML files are going to live in terms of file structure yet? Once we figure that out, parsing those values should be fairly simple using the corresponding transformer plugins.

nvanwitt commented 5 years ago

@dylanesque Good question. I think we were leaning towards in-browser storage to avoid storing personal data on a server/filesystem but that may change in the future if we had more functionality to the site. Although, I think if we're going to store anything for future use ( i.e. user returns to pick up where they left off) then we might be able to narrow down important info without having to store records.
What do you think?

Side note, if you're looking to test things out for this I have a branch going with file upload, xml-parsing, and redux storage, feel free to play with it.

dylanesque commented 5 years ago

Sure thing. My concern here is that Gatsby is not great at instant updates based on data passed in (to the back-end), and yes, it doesn't make sense to store user data on the site in any way here. Worst case scenario, we have to shift to using React or React with Next.js if technical difficulties occur with Gatsby, which isn't too terrible. I'm curious as to what Alex thinks about this.

thadk commented 5 years ago

Thanks for all your thinking on this everyone! I hope to see you on Tuesday.

My initial impression on this is that we need to start with the example we have. We might need optical character recognition more than PDF support, if some users just take photos of the page(s). You can find the example a bit like that in the project shared folder.

With a picture, it could be more like a JPG than a PDF.