Open alanjosephwilliams opened 10 years ago
This is one of those ideas that are equally mad and intriguing. I would love to see an MVP of this.
@alanjosephwilliams I can help spelunk! You have a repo or dropbox folder for this?
Code for DC's "District Housing" project ( https://github.com/codefordc/districthousing ) takes a similar approach to this, specifically for Section 8-eligible housing applications. cc @jrunningen @jposi who seem to be active in that project.
cc @mlouie
I can give a synopsis of how the District Housing Rails app does it.
We have a database schema that models the information. The structure of District Housing's web form very closely matches the structure of this schema.
We have a standard naming convention for PDF field names. If you're editing a PDF with Acrobat, and name the fillable fields according to this standard, then District Housing can compute the information that belongs in that field from the contents of the database.
The code doing the translation from the database to PDF can be found in the #value_for_field methods of our various models. It's generally a giant case statement that uses string and regex matching to find the right answer, or delegate it to another model. For example, see the Person model.
The District Housing standard naming convention for PDF field names fills me with hope for a better future! The field names in PDFs I've seen have generally been pretty bizarre/arbitrary.
If we're trying to collect a (likely enormous) existing set of PDFs for analysis, how about writing a scraper that grabs all fillable PDFs from .gov sites and maybe also records their field names?
We did something like this for an MVP of Parks and Rec in the Greater Las Vegas area (because we have 4 parks and rec departments). Sadly, most of the forms weren't editable, so we converted the pdfs to images, mapped where each field needed to be filled in and used ImageMagick to add the text where it needed to be, then let people print them out, since said agencies required them printed, not online.
It's definitely more work per form, but doesn't require any assistance from agencies that may not be willing, and doesn't limit you to only editable pdfs.
Note, this is a suggestion in addition to, not in place of the current ones...
@daguar and I have been discussing a related MVP that would be useful in a current CfA project.
It would be a redeployable web app that would enable the following set of actions:
https://www.pdffiller.com/ did this
@daguar your comment just reminded me that I ended up making this as a little open source experiment: https://pdfhook.herokuapp.com
BLUF:
In the course of our work on Clean we have started thinking more generically about the process of taking key information about an individual through a web form, and then using that data to populate one or more existing paper/PDF forms. The idea being—if we map the location of the "first name" field on all of these forms, we could have data submitted once, and written to many forms using PDFtk.
So. how about we collect all PDF forms that one could potentially need to fill out with that personal information. In other words, let's try to collect all government PDF/Paper forms available on the internet. Maybe we could start with a single state, like California. Let's learn whether obvious taxonomies for forms already exist, or whether we could craft some lightweight categories.
This idea is inspired by the spirit and tactical approach of OpenAddresses. OpenAddresses collects address data in any format as long as it has a stable URL on the web. The URL is housed within a JSON blob containing other metadata such as its origins, the relevant location, and the type of data. Anybody can submit a link to OpenAddresses. The project leads then write handlers to covert the diverse data into a common format and schema for use.
A bit of a h/t to @daguar on this idea. I've asked him to come in and edit the body of this idea to flush out the approach a bit more. @lippytak was involved too, he might also have color to add.