Closed cverluise closed 3 years ago
Implemented in patcit@nightly using pycld2 (based on CLD2 which is itself derived for the chromium compact language detector project)
Note for dev: I chose CLD2 rather than CLD3 because CLD2 guarantees text preprocessing (such as url cleaning etc) while CLD3 does not which can cause strange errors.
Unknown
seems to be mainly very short npl, in particular bibliographical references with many abbreviations -> they should be kept
Addressed in v03 🎉 .
The npl_cat
classifier was trained on examples in english (and unknown) only. A npl_cat_flag
bool was added to the v03.
npl_cat_flag
:
npl_cat_flag=True
.
Closing this issue, feel free to reopen.
Due to the limited abilities of the labelers (including me), the classification model was trained only on English (and some other Latin-based languages) examples. Hence, citations based on non Latin mess up the classification out of sample.
Proposal
Add a language detection pipeline. E.g. spaCy-langdetect or spaCy-cld and exclude non english citations/ create a specific subset