Closed ghost closed 6 years ago
Description Fix regex for better regex'ing on machine learning container
Scope of Work
Bugs found
1. https://github.com/Cyberjusticelab/JusticeAI/blob/759b92171d116c7a1b12b53f9f53dec96dc3d323/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L10 "de" needs to be included for months starting with a consonant for regex to properly match sentences such as https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L423
2. Please also investigate sentences like: https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L423 as the lack of inclusion of a verb "n'a pas paye" while it's included in other places raises a red flag to me.
3. In here, "de" should not be included as it should be taken care of by the DATE_REGEX. https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L431 or else you the regex will attempt to match "de d'aout".
4. In line https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L7 and https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L8 regex is matching the plural form of the nouns locateurs/locatrices/locataires but throughout the text many omissions of conjugaison troisieme personne du pluriels are missing. Take line: https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L427 Ex: "Les locataires n'ont pas paye le loyer d'octobre" will never be matched because statically that line is only "n'a pas" which is troisieme personne du singulier.
This mistake is often repeated throughout the regex file.
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L492 missing the "a la locatrice" feminine version"
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L539 "la TENANT_REGEX" missing
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L544 "aux locateurs" instead of LANDLORD_REGEX with le/les/a la"
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L592 "a preprendre possesion du lieux" singular version of "lieux" (lieu) missing Although I do understand that if "du" is used, they will most likely write "logement", this is a "just in case" regex
@mihaiqc can you please have another look at the regex_lib file to see if there's any other obvious mistakes.
Understood I'll attempt to spot all possible mistakes.
Description Fix regex for better regex'ing on machine learning container
Scope of Work
Bugs found
1. https://github.com/Cyberjusticelab/JusticeAI/blob/759b92171d116c7a1b12b53f9f53dec96dc3d323/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L10 "de" needs to be included for months starting with a consonant for regex to properly match sentences such as https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L423
2. Please also investigate sentences like: https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L423 as the lack of inclusion of a verb "n'a pas paye" while it's included in other places raises a red flag to me.
3. In here, "de" should not be included as it should be taken care of by the DATE_REGEX. https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L431 or else you the regex will attempt to match "de d'aout".
4. In line https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L7 and https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L8 regex is matching the plural form of the nouns locateurs/locatrices/locataires but throughout the text many omissions of conjugaison troisieme personne du pluriels are missing. Take line: https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L427 Ex: "Les locataires n'ont pas paye le loyer d'octobre" will never be matched because statically that line is only "n'a pas" which is troisieme personne du singulier.
This mistake is often repeated throughout the regex file.
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L492 missing the "a la locatrice" feminine version"
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L539 "la TENANT_REGEX" missing
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L544 "aux locateurs" instead of LANDLORD_REGEX with le/les/a la"
https://github.com/Cyberjusticelab/JusticeAI/blob/1f56afdb8c4786b723777649f1b74c004f1267a5/src/ml_service/feature_extraction/post_processing/regex/regex_lib.py#L592 "a preprendre possesion du lieux" singular version of "lieux" (lieu) missing Although I do understand that if "du" is used, they will most likely write "logement", this is a "just in case" regex