colarusso commented 12 months ago

Some users have forms where English is presented alongside translations to other languages. The presence of the other language, however, throws off the evaluation. As a workaround they edited their forms to block out the translations. However, it was suggested that if this could be done automatically, it would be a nice feature. Note: I can see a possible path to this using LLMs.

nonprofittechy commented 5 months ago

This one doesn't seem very plausible yet, but maybe? There are some Python libraries that detect language:

I guess we'd also need to "chunk" the form's text so we detect on a per-paragraph basis.

https://pypi.org/project/langdetect/

nonprofittechy commented 5 months ago

@KindBill it would be good to get Kind's help at least with setting up the UI for this feature

KindBill commented 5 months ago

Use Cases:

English only forms
English + non-English language(s) forms
Non-English language forms only

Outcomes:

Disclaimer on the results page that non-English languages were detected (i.e. "This form has these non-English languages included: "); This disclaimer should also be broad enough to also handle forms that are 100% non-English.
The "Percentage of difficult words" and "reading grade level" scores do not include non-English language sentences.

References:

Changes most likely will be done here https://github.com/SuffolkLITLab/FormFyxer/blob/fce844760001c8998a3d40765bf894cf4f7d50cd/formfyxer/lit_explorer.py#L1178

Notes:

Language detections without LLMs: ( langid / Lingua / langdetect ); can start with langdetect and see if it works
If we know which language sets we can remove, that'll help reduce processing time on langdetect. If we do so, should definitely note down which ones were removed.

KindBill commented 5 months ago

We'll start on the original list first and see where we're at before looking at this one: https://github.com/SuffolkLITLab/RateMyPDF/issues?q=is%3Aissue+is%3Aopen+label%3Akind

codestronger commented 4 months ago

Here's a summary of our research into language detection:

Pretty good on all non-Latin alphabet forms (e.g. Chinese)
Langdetect is poor on bilingual Spanish forms. A lot of English is misclassified as German or Vietnamese.
The following libraries were tested. Should be pretty easy to plug in others for testing.
- langdetect - https://pypi.org/project/langdetect/
- langid.py - https://github.com/saffsd/langid.py
- Lingua - https://github.com/pemistahl/lingua-py
Lingua had the best performance based on anecdotal testing
All the libraries had issues
Hard to find good universal config settings for chunking the text
Ended up w/ a simple paragrapher that has configurable minimum lines and minimum characters. Default is 3 and 30.
Some docs have no non-English but the false positives will trigger the suggestion, so a minimum threshold % is useful. This defaults to a threshold of 5% non-English text before we use the alternate text
PoC strips out non-English paragraphs and passes that through the stats calc
- Didn't affect the complexity or time to complete in many cases
- Suggestions improved for difficult words & gender neutral terms.
- Detailed stats were more accurate as the non-English parts were not considered.

The experimental branches are pushed up to RateMyPDF and FormFyxer. They should not be merged in the current condition, but can be useful for refining the approach/gathering better data on how to configure paragraph chunking. RateMyPDF: https://github.com/SuffolkLITLab/RateMyPDF/tree/do_not_merge_langdetect_experiment FormFyxer: https://github.com/SuffolkLITLab/FormFyxer/tree/do_not_merge_langdetect_experiment

Some source docs that we tested:

ENV Configuration

export LANGUAGE_DETECTION_PRIMARY_LIBRARY=langdetect # langdetect, langid, lingua export DEBUG_LANGUAGE_DETECTION_PRINT_ALL=FALSE export USE_LANGUAGE_DETECTION=TRUE export DEBUG_LANGUAGE_DETECTION=TRUE export LANGUAGE_DETECTION_PARAGRAPH_MIN_LINES export LANGUAGE_DETECTION_PARAGRAPH_MIN_CHARS export LANGUAGE_DETECTION_THRESHOLD_PERCENTAGE=0.1

Examples illustrating the various edge cases/issues/weirdness mentioned in the intro. Note that these were captured on an older iteration that only used langdetect. Current debug output will look different.

Example 1 ===== Start Paragraph len: 78 GOVERNMENTAL AGENCY (under Fam. Code, §§ 17400 and 17406):

FOR COURT USE ONLY ===== End Paragraph lang: de confidences: [de:0.9999957743680956]

Example 2 ===== Start Paragraph len: 30 Modify Order

Beginning Date ===== End Paragraph lang: de confidences: [de:0.9999935651105715]

Example 3 ===== Start Paragraph len: 36

BRANCH NAME:

PETITIONER/PLAINTIFF: ===== End Paragraph lang: vi confidences: [vi:0.9999958307268626]

Bad Paragraph Break because our logic is too simple leads to Spanish sneaking pass. ===== Start Paragraph len: 256 pay or other property without further notice. See the attached statement of your rights and responsibilities for more information.

La agencia local que vigila la manutención de menores ha registrado la presente demanda contra usted. Esta demanda dice que ===== End Paragraph lang: en confidences: [en:0.9999962147279318]

Chinese as English ===== Start Paragraph len: 60

（键入或打印姓名）

（本地子女抚养机构辩护律师）

FL-600 C [Rev. January 1, 2020] ===== End Paragraph lang: en confidences: [en:0.857138603438145, zh-cn:0.14285660702774805]

Spanish as Romanian ===== Start Paragraph len: 39 Imprimir formulario

Guardar formulario ===== End Paragraph lang: ro confidences: [ro:0.9999965586847656]

Spanish as German ===== Start Paragraph len: 30

Fecha:

FUNCIONARIO JUDICIAL ===== End Paragraph lang: de confidences: [de:0.8571398401729963, pt:0.1428571115230109]

Different Lang, Same Doc Detection seems unstable. These paragraphs are from the same document but classified differently for some reason. Also don't understand how the confidence is 71% German but it returns language as English. You would expect the detect and detect_langs methods use the same underlying logic, but apparently not. We use the detect response for the purposes of filtering. The detect_langs is purely informational.

===== Start Paragraph len: 37

RESPONDENT/DEFENDANT:

OTHER PARENT: ===== End Paragraph lang: de confidences: [de:0.5714273010344291, en:0.42857232144158824]

===== Start Paragraph len: 37

RESPONDENT/DEFENDANT:

OTHER PARENT: ===== End Paragraph lang: en confidences: [de:0.7142828187149544, en:0.2857171599270042]

codestronger commented 4 months ago

Example of the how this might get displayed on the RateMyPDF website. It's mostly invisible and the stats/suggestions/etc are calculated from the English text only. A single suggestion will be added to the accordion to inform the user that we excluded the non-English from those calculations.

SuffolkLITLab / RateMyPDF

Don't rate text that isn't in English if the form is bilingual #28

ENV Configuration