SuffolkLITLab / RateMyPDF

RateMyPDF is a website that helps paper form authors (particularly for court forms) improve the usability of their forms for self-represented litigants. It uses the FormFyxer library to deliver its insights.
https://ratemypdf.com
MIT License
1 stars 1 forks source link

Don't rate text that isn't in English if the form is bilingual #28

Open colarusso opened 10 months ago

colarusso commented 10 months ago

Some users have forms where English is presented alongside translations to other languages. The presence of the other language, however, throws off the evaluation. As a workaround they edited their forms to block out the translations. However, it was suggested that if this could be done automatically, it would be a nice feature. Note: I can see a possible path to this using LLMs.

nonprofittechy commented 4 months ago

This one doesn't seem very plausible yet, but maybe? There are some Python libraries that detect language:

I guess we'd also need to "chunk" the form's text so we detect on a per-paragraph basis.

https://pypi.org/project/langdetect/

nonprofittechy commented 3 months ago

@KindBill it would be good to get Kind's help at least with setting up the UI for this feature

KindBill commented 3 months ago

Use Cases:

  1. English only forms
  2. English + non-English language(s) forms
  3. Non-English language forms only

Outcomes:

  1. Disclaimer on the results page that non-English languages were detected (i.e. "This form has these non-English languages included: "); This disclaimer should also be broad enough to also handle forms that are 100% non-English.
  2. The "Percentage of difficult words" and "reading grade level" scores do not include non-English language sentences.

References:

  1. Changes most likely will be done here https://github.com/SuffolkLITLab/FormFyxer/blob/fce844760001c8998a3d40765bf894cf4f7d50cd/formfyxer/lit_explorer.py#L1178

Notes:

  1. Language detections without LLMs: ( langid / Lingua / langdetect ); can start with langdetect and see if it works
  2. If we know which language sets we can remove, that'll help reduce processing time on langdetect. If we do so, should definitely note down which ones were removed.
KindBill commented 3 months ago

We'll start on the original list first and see where we're at before looking at this one: https://github.com/SuffolkLITLab/RateMyPDF/issues?q=is%3Aissue+is%3Aopen+label%3Akind

codestronger commented 2 months ago

Here's a summary of our research into language detection:

The experimental branches are pushed up to RateMyPDF and FormFyxer. They should not be merged in the current condition, but can be useful for refining the approach/gathering better data on how to configure paragraph chunking. RateMyPDF: https://github.com/SuffolkLITLab/RateMyPDF/tree/do_not_merge_langdetect_experiment FormFyxer: https://github.com/SuffolkLITLab/FormFyxer/tree/do_not_merge_langdetect_experiment

Some source docs that we tested:

ENV Configuration

export LANGUAGE_DETECTION_PRIMARY_LIBRARY=langdetect # langdetect, langid, lingua export DEBUG_LANGUAGE_DETECTION_PRINT_ALL=FALSE export USE_LANGUAGE_DETECTION=TRUE export DEBUG_LANGUAGE_DETECTION=TRUE export LANGUAGE_DETECTION_PARAGRAPH_MIN_LINES export LANGUAGE_DETECTION_PARAGRAPH_MIN_CHARS export LANGUAGE_DETECTION_THRESHOLD_PERCENTAGE=0.1

Examples illustrating the various edge cases/issues/weirdness mentioned in the intro. Note that these were captured on an older iteration that only used langdetect. Current debug output will look different.

Example 1 ===== Start Paragraph len: 78 GOVERNMENTAL AGENCY (under Fam. Code, §§ 17400 and 17406):

FOR COURT USE ONLY ===== End Paragraph lang: de confidences: [de:0.9999957743680956]

Example 2 ===== Start Paragraph len: 30 Modify Order

Beginning Date ===== End Paragraph lang: de confidences: [de:0.9999935651105715]

Example 3 ===== Start Paragraph len: 36

BRANCH NAME:

PETITIONER/PLAINTIFF: ===== End Paragraph lang: vi confidences: [vi:0.9999958307268626]

Bad Paragraph Break because our logic is too simple leads to Spanish sneaking pass. ===== Start Paragraph len: 256 pay or other property without further notice. See the attached statement of your rights and responsibilities for more information.

La agencia local que vigila la manutención de menores ha registrado la presente demanda contra usted. Esta demanda dice que ===== End Paragraph lang: en confidences: [en:0.9999962147279318]

Chinese as English ===== Start Paragraph len: 60

(键入或打印姓名)

(本地子女抚养机构辩护律师)

FL-600 C [Rev. January 1, 2020] ===== End Paragraph lang: en confidences: [en:0.857138603438145, zh-cn:0.14285660702774805]

Spanish as Romanian ===== Start Paragraph len: 39 Imprimir formulario

Guardar formulario ===== End Paragraph lang: ro confidences: [ro:0.9999965586847656]

Spanish as German ===== Start Paragraph len: 30

Fecha:

FUNCIONARIO JUDICIAL ===== End Paragraph lang: de confidences: [de:0.8571398401729963, pt:0.1428571115230109]

Different Lang, Same Doc Detection seems unstable. These paragraphs are from the same document but classified differently for some reason. Also don't understand how the confidence is 71% German but it returns language as English. You would expect the detect and detect_langs methods use the same underlying logic, but apparently not. We use the detect response for the purposes of filtering. The detect_langs is purely informational.

===== Start Paragraph len: 37

RESPONDENT/DEFENDANT:

OTHER PARENT: ===== End Paragraph lang: de confidences: [de:0.5714273010344291, en:0.42857232144158824]

===== Start Paragraph len: 37

RESPONDENT/DEFENDANT:

OTHER PARENT: ===== End Paragraph lang: en confidences: [de:0.7142828187149544, en:0.2857171599270042]

codestronger commented 2 months ago

Example of the how this might get displayed on the RateMyPDF website. It's mostly invisible and the stats/suggestions/etc are calculated from the English text only. A single suggestion will be added to the accordion to inform the user that we excluded the non-English from those calculations.

langdetect_ui_example