SuffolkLITLab / FormFyxer

A tool for learning about and pre-processing forms
MIT License
11 stars 1 forks source link

FormFyxer

PyPI version

A Python package with a collection of functions for learning about and pre-processing pdf forms and associated form fields. This processing is done with an eye towards interoperability with the Suffolk LIT Lab's Document Assembly Line Project.

This repository is the engine for RateMyPDF. It has been described in a paper published in the proceedings of ICAIL '23. You can view it here.

Installation and updating

Use the package manager pip to install FormFyxer. Rerun this command to check for and install updates directly from GitHub.

pip install git+https://github.com/SuffolkLITLab/FormFyxer

If you are on Mac or Windows, you'll need to install poppler for your respective platform. If you are on Anaconda, simply run conda install poppler. Otherwise, follow the instructions here:

Testing

TOOLS_TOKEN=<your_token_here> ISUNITTEST=True python -m unittest formfyxer.tests.cluster_test

You should test with and without TOOLS_TOKEN, and make sure that both pass.

Functions

Functions from pdf_wrangling are found on our documentation site.

formfyxer.re_case(text)

Reformats snake_case, camelCase, and similarly-formatted text into individual words.

Parameters:

formfyxer.regex_norm_field(text)

Given an auto-generated field name (e.g., those applied by a PDF editor's find form fields function), this function uses regular expressions to replace common auto-generated field names for those found in our standard field names.

Parameters:

formfyxer.reformat_field(text,max_length=30)

Given a string of words, this function provides a summary of the string's semantic content by boiling it down to a few words. It then reformats these keywords into snake_case.

Parameters:

formfyxer.normalize_name(jur,group,n,per,last_field,this_field)

This function will use the above functions to produce a field name conforming to the format of our standard field names. It does this first by applying reCase() to the text of a field. It then applies regex_norm_field(). If a standard field name is NOT found, it makes use of a machine learning model we have trained to classify the text as one of our standard field names. If the model is confident in a classification, it changes the text to that field name. If it us uncertian, it applies reformat_field(). The end result is that you can feed in a field name and receive output that has been converted into either one of our standard fields or a string of similar formatting.

Parameters:

formfyxer.vectorize(text,normalize=0)

A simple wrapper for Spacy's word2vec vectorization of a string.

Parameters:

back to top

formfyxer.spot(text,lower=0.25,pred=0.5,upper=0.6,verbose=0)

A simple wrapper for the LIT Lab's NLP issue spotter Spot. In order to use this feature you must edit the spot_token.txt file found in this package to contain your API token. You can sign up for an account and get your token on the Spot website.

Given a string, this function will return a list of LIST entities/issues found in the text. Items are filtered by estimates of how likely they are to be present. The values dictating this filtering are controlled by the optional lower, pred, and upper parameters. These refer to the lower bound of the predicted likelihood that an entity is present, the predicted likelihood it is present, and the upper-bound of this prediction respectively.

Parameters:

formfyxer.spot("my landlord kicked me out", verbose=1) {'build': 9, 'query-id': '1efa5a098bc24f868684339f638ab7eb', 'text': 'my landlord kicked me out', 'save-text': 0, 'cutoff-lower': 0.25, 'cutoff-pred': 0.5, 'cutoff-upper': 0.6, 'labels': [{'id': 'HO-00-00-00-00', 'name': 'Housing', 'lower': 0.6614134886446631, 'pred': 0.7022160833303629, 'upper': 0.7208275781222152, 'children': [{'id': 'HO-02-00-00-00', 'name': 'Eviction from a home', 'lower': 0.4048013980740931, 'pred': 0.5571460102525152, 'upper': 0.6989976788434928}, {'id': 'HO-05-00-00-00', 'name': 'Problems with living conditions', 'lower': 0.3446066253503793, 'pred': 0.5070074487913626, 'upper': 0.6326627767849852}, {'id': 'HO-06-00-00-00', 'name': 'Renting or leasing a home', 'lower': 0.6799417713794678, 'pred': 0.8984004824420323, 'upper': 0.9210222500232965, 'children': [{'id': 'HO-02-00-00-00', 'name': 'Eviction from a home', 'lower': 0.4048013980740931, 'pred': 0.5571460102525152, 'upper': 0.6989976788434928}]}]}]}


[back to top](#formfyxer)

formfyxer.guess_form_name(text)

An OpenAI-enabled tool that will guess the name of a court form given the full text of the form. In order to use this feature you must edit the openai_org.txt and openai_key.txt files found in this package to contain your OpenAI credentials. You can sign up for an account and get your token on the OpenAI signup.

Given a string conataining the full text of a court form, this function will return its best guess for the name of the form.

Parameters:

formfyxer.plain_lang(text)

An OpenAI-enabled tool that will rewrite a text into a plain language draft. In order to use this feature you must edit the openai_org.txt and openai_key.txt files found in this package to contain your OpenAI credentials. You can sign up for an account and get your token on the OpenAI signup.

Given a string, this function will return its attempt at rewriting the srting in plain language.

Parameters:

formfyxer.describe_form(text)

An OpenAI-enabled tool that will write a draft plain language description for a form. In order to use this feature you must edit the openai_org.txt and openai_key.txt files found in this package to contain your OpenAI credentials. You can sign up for an account and get your token on the OpenAI signup.

Given a string containing the full text of a court form, this function will return its a draft description of the form written in plain language.

Parameters:

formfyxer.parse_form(fileloc,title=None,jur=None,cat=None,normalize=1,use_spot=0,rewrite=0)

Read in a pdf with pre-existing form fields, pull out basic stats, attempt to normalize its field names, and re-write the file with the new fields (if rewrite=1).

Parameters:

formfyxer.cluster_screens(fields,damping=0.7)

This function will take a list of snake_case field names and group them by semantic similarity.

Parameters:

formfyxer.get_sensitive_data_types(fields, fields_old)

Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive fields grouped by type. A list of the old field names can also be provided. These fields should be in the same order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value will not contain the old field name, only the corresponding field name from the first parameter.

The sensitive field types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security Number.

Parameters:

License

MIT

Preferred citation format

Please cite this repository as follows:

Quinten Steenhuis, Bryce Willey, and David Colarusso. 2023. Beyond Readability with RateMyPDF: A Combined Rule-based and Machine Learning Approach to Improving Court Forms. In Proceedings of International Conference on Artificial Intelligence and Law (ICAIL 2023). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3594536.3595146

Bibtex format:

@article{Steenhuis_Willey_Colarusso_2023, title={Beyond Readability with RateMyPDF: A Combined Rule-based and Machine Learning Approach to Improving Court Forms}, DOI={https://doi.org/10.1145/3594536.3595146}, journal={Proceedings of International Conference on Artificial Intelligence and Law (ICAIL 2023)}, author={Steenhuis, Quinten and Willey, Bryce and Colarusso, David}, year={2023}, pages={287–296}}