Open colarusso opened 12 months ago
This one doesn't seem very plausible yet, but maybe? There are some Python libraries that detect language:
I guess we'd also need to "chunk" the form's text so we detect on a per-paragraph basis.
@KindBill it would be good to get Kind's help at least with setting up the UI for this feature
Use Cases:
Outcomes:
References:
Notes:
We'll start on the original list first and see where we're at before looking at this one: https://github.com/SuffolkLITLab/RateMyPDF/issues?q=is%3Aissue+is%3Aopen+label%3Akind
Here's a summary of our research into language detection:
The experimental branches are pushed up to RateMyPDF and FormFyxer. They should not be merged in the current condition, but can be useful for refining the approach/gathering better data on how to configure paragraph chunking. RateMyPDF: https://github.com/SuffolkLITLab/RateMyPDF/tree/do_not_merge_langdetect_experiment FormFyxer: https://github.com/SuffolkLITLab/FormFyxer/tree/do_not_merge_langdetect_experiment
Some source docs that we tested:
export LANGUAGE_DETECTION_PRIMARY_LIBRARY=langdetect # langdetect, langid, lingua export DEBUG_LANGUAGE_DETECTION_PRINT_ALL=FALSE export USE_LANGUAGE_DETECTION=TRUE export DEBUG_LANGUAGE_DETECTION=TRUE export LANGUAGE_DETECTION_PARAGRAPH_MIN_LINES export LANGUAGE_DETECTION_PARAGRAPH_MIN_CHARS export LANGUAGE_DETECTION_THRESHOLD_PERCENTAGE=0.1
Examples illustrating the various edge cases/issues/weirdness mentioned in the intro. Note that these were captured on an older iteration that only used langdetect. Current debug output will look different.
Example 1 ===== Start Paragraph len: 78 GOVERNMENTAL AGENCY (under Fam. Code, §§ 17400 and 17406):
FOR COURT USE ONLY ===== End Paragraph lang: de confidences: [de:0.9999957743680956]
Example 2 ===== Start Paragraph len: 30 Modify Order
Beginning Date ===== End Paragraph lang: de confidences: [de:0.9999935651105715]
Example 3 ===== Start Paragraph len: 36
BRANCH NAME:
PETITIONER/PLAINTIFF: ===== End Paragraph lang: vi confidences: [vi:0.9999958307268626]
Bad Paragraph Break because our logic is too simple leads to Spanish sneaking pass. ===== Start Paragraph len: 256 pay or other property without further notice. See the attached statement of your rights and responsibilities for more information.
La agencia local que vigila la manutención de menores ha registrado la presente demanda contra usted. Esta demanda dice que ===== End Paragraph lang: en confidences: [en:0.9999962147279318]
Chinese as English ===== Start Paragraph len: 60
(键入或打印姓名)
(本地子女抚养机构辩护律师)
FL-600 C [Rev. January 1, 2020] ===== End Paragraph lang: en confidences: [en:0.857138603438145, zh-cn:0.14285660702774805]
Spanish as Romanian ===== Start Paragraph len: 39 Imprimir formulario
Guardar formulario ===== End Paragraph lang: ro confidences: [ro:0.9999965586847656]
Spanish as German ===== Start Paragraph len: 30
Fecha:
FUNCIONARIO JUDICIAL ===== End Paragraph lang: de confidences: [de:0.8571398401729963, pt:0.1428571115230109]
Different Lang, Same Doc Detection seems unstable. These paragraphs are from the same document but classified differently for some reason. Also don't understand how the confidence is 71% German but it returns language as English. You would expect the detect and detect_langs methods use the same underlying logic, but apparently not. We use the detect response for the purposes of filtering. The detect_langs is purely informational.
===== Start Paragraph len: 37
RESPONDENT/DEFENDANT:
OTHER PARENT: ===== End Paragraph lang: de confidences: [de:0.5714273010344291, en:0.42857232144158824]
===== Start Paragraph len: 37
RESPONDENT/DEFENDANT:
OTHER PARENT: ===== End Paragraph lang: en confidences: [de:0.7142828187149544, en:0.2857171599270042]
Example of the how this might get displayed on the RateMyPDF website. It's mostly invisible and the stats/suggestions/etc are calculated from the English text only. A single suggestion will be added to the accordion to inform the user that we excluded the non-English from those calculations.
Some users have forms where English is presented alongside translations to other languages. The presence of the other language, however, throws off the evaluation. As a workaround they edited their forms to block out the translations. However, it was suggested that if this could be done automatically, it would be a nice feature. Note: I can see a possible path to this using LLMs.