SuffolkLITLab / RateMyPDF

RateMyPDF is a website that helps paper form authors (particularly for court forms) improve the usability of their forms for self-represented litigants. It uses the FormFyxer library to deliver its insights.
https://ratemypdf.com
MIT License
1 stars 1 forks source link

Add a section that warns about some sensitive fields #25

Open nonprofittechy opened 1 year ago

nonprofittechy commented 1 year ago

Some fields on the PDF are going to raise concerns for the litigant. We should pull those out and highlight them in one of the accordions. For example:

KindBill commented 5 months ago

From the other ticket: (including synonyms)

From 5/22/24 call:

  1. Detection isn't implemented yet
  2. Detection would need to be added to Form fixer library or could be a keyword match
  3. This is the label of the field (flag the field if keywords appear in the document)
  4. Should also create synonym
  5. Output would be under Suggestions as a separate entry (Quinten can provide copy) to highlight the fields that are showing sensitive data
  6. Quinten recommends to make the change in FormFyxer (referenced below)

Reference:

  1. FormFyxer's generated: https://github.com/SuffolkLITLab/FormFyxer/blob/fce844760001c8998a3d40765bf894cf4f7d50cd/formfyxer/lit_explorer.py#L1212
codestronger commented 4 months ago

I think I've gotten enough of a handle on RateMyPDF and FormFyxer to implement this correctly now. Will work on the real implementation next.

Attached below is a mockup created w/ placeholder data. Let me know if you have any suggestions about the wording or relative positioning of the suggestion within the list.

sensitive-fields-mockup2
codestronger commented 4 months ago

Hmm, the form fields in the stats seem to be normalized identifiers (e.g. cid_0credit_card). If I use the fields, then the list displayed isn't very human-readable. I'm thinking instead of checking the fields, I can run the detection against the extracted text of the form and assume any matches are related to fields. This would result in more false positives. For example, it would flag SSN if the text contained "An SSN is not required." On the other hand, the fields listed in the suggestion would be much easier to read: Credit Card instead of cid_0credit_card.

nonprofittechy commented 4 months ago

@codestronger I agree with your idea to search the full text of the form, not just the normalized field names.

codestronger commented 4 months ago

This idea didn't work out as well as I thought. Too many false positives. What I'm going to try next is go back to looking at the field names, but instead of displaying the field name in the UI, we'll group them into categories and display the category name.

nonprofittechy commented 4 months ago

Another thought, we can check the field name prior to normalization

codestronger commented 4 months ago

Good idea! I'll take another look at those. I vaguely remember that they didn't look much different from the normalized names.

codestronger commented 4 months ago

Ended up using both the original field names and normalized field names. The normalized field names are generally more consistent to match against, but sometimes the rewrites are bad. Here's an example of one I ran into during my testing:

original: ['DOCKET NUMBER', 'COURT NAME ADDRESS', 'I am attorney of record for', 'plaintiff', 'defendant in the aboveentitled matter', 'Signature', 'Print name', 'Address 1', 'Address 2', 'Address 3', 'BBO', 'Text3', 'Text4', 'Text5', 'County', 'Clerks']

new_names: ['docke_number', 'cour_nam_e_address', 'attorney_record', 'plaintiff_name', 'defendant_aboveentitled_matter', '*users1_signature', 'print_name', 'address__1', 'address__2', 'address__3', 'bbo', 'text__1', 'text__2', 'text__3', 'county', 'clerks']

I used simple regex heuristics and bucketed matching fields into the set of:

Let me know if we want to adjust any of those names.

codestronger commented 4 months ago

Example of the current sensitive fields suggestions:

sensitive_fields

Source Form:

test_form_sensitive_fields
codestronger commented 4 months ago

Adding this for future reference: https://www.mass.gov/doc/financial-statement-of-judgment-debtor/download

The PDF contains sensitive fields, but we are unable to detect them because the field recognition wasn't able to capture the information needed. Here's a dump of all the current field info: #[Type: text, Name: division, tooltip: , X: 527, Y: 718, font_size: 7, Configs: {'fieldFlags': 'doNotScroll', 'width': 32.256000000000085, 'height': 7.775999999999954}, Type: text, Name: cid_2__cid_3_boston_municipal_court_cid_2__cid_3_district_court_cid_2__cid_3_housing_court, tooltip: , X: 489, Y: 706, font_size: 7, Configs: {'fieldFlags': 'doNotScroll', 'width': 31.96799999999996, 'height': 7.775999999999954}, Type: text, Name: page_0_field_2, tooltip: , X: 490, Y: 694, font_size: 7, Configs: {'fieldFlags': 'doNotScroll', 'width': 32.25600000000003, 'height': 7.775999999999954}]

codestronger commented 4 months ago

I attached a screenshot of how the sensitive data types suggestion looks with the data type + field names under them. This will correspond to the latest code in the PR. Let me know if we want to make any tweaks!

sensitive-data-types