Open nonprofittechy opened 1 year ago
From the other ticket: (including synonyms)
From 5/22/24 call:
Reference:
I think I've gotten enough of a handle on RateMyPDF and FormFyxer to implement this correctly now. Will work on the real implementation next.
Attached below is a mockup created w/ placeholder data. Let me know if you have any suggestions about the wording or relative positioning of the suggestion within the list.
Hmm, the form fields in the stats seem to be normalized identifiers (e.g. cid_0credit_card). If I use the fields, then the list displayed isn't very human-readable. I'm thinking instead of checking the fields, I can run the detection against the extracted text of the form and assume any matches are related to fields. This would result in more false positives. For example, it would flag SSN if the text contained "An SSN is not required." On the other hand, the fields listed in the suggestion would be much easier to read: Credit Card instead of cid_0credit_card.
@codestronger I agree with your idea to search the full text of the form, not just the normalized field names.
This idea didn't work out as well as I thought. Too many false positives. What I'm going to try next is go back to looking at the field names, but instead of displaying the field name in the UI, we'll group them into categories and display the category name.
Another thought, we can check the field name prior to normalization
Good idea! I'll take another look at those. I vaguely remember that they didn't look much different from the normalized names.
Ended up using both the original field names and normalized field names. The normalized field names are generally more consistent to match against, but sometimes the rewrites are bad. Here's an example of one I ran into during my testing:
original:
['DOCKET NUMBER', 'COURT NAME ADDRESS', 'I am attorney of record for', 'plaintiff', 'defendant in the aboveentitled matter', 'Signature', 'Print name', 'Address 1', 'Address 2', 'Address 3', 'BBO', 'Text3', 'Text4', 'Text5', 'County', 'Clerks']
new_names:
['docke_number', 'cour_nam_e_address', 'attorney_record', 'plaintiff_name', 'defendant_aboveentitled_matter', '*users1_signature', 'print_name', 'address__1', 'address__2', 'address__3', 'bbo', 'text__1', 'text__2', 'text__3', 'county', 'clerks']
I used simple regex heuristics and bucketed matching fields into the set of:
Let me know if we want to adjust any of those names.
Example of the current sensitive fields suggestions:
Source Form:
Adding this for future reference: https://www.mass.gov/doc/financial-statement-of-judgment-debtor/download
The PDF contains sensitive fields, but we are unable to detect them because the field recognition wasn't able to capture the information needed. Here's a dump of all the current field info:
#[Type: text, Name: division, tooltip: , X: 527, Y: 718, font_size: 7, Configs: {'fieldFlags': 'doNotScroll', 'width': 32.256000000000085, 'height': 7.775999999999954}, Type: text, Name: cid_2__cid_3_boston_municipal_court_cid_2__cid_3_district_court_cid_2__cid_3_housing_court, tooltip: , X: 489, Y: 706, font_size: 7, Configs: {'fieldFlags': 'doNotScroll', 'width': 31.96799999999996, 'height': 7.775999999999954}, Type: text, Name: page_0_field_2, tooltip: , X: 490, Y: 694, font_size: 7, Configs: {'fieldFlags': 'doNotScroll', 'width': 32.25600000000003, 'height': 7.775999999999954}]
I attached a screenshot of how the sensitive data types suggestion looks with the data type + field names under them. This will correspond to the latest code in the PR. Let me know if we want to make any tweaks!
Some fields on the PDF are going to raise concerns for the litigant. We should pull those out and highlight them in one of the accordions. For example: