biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 84 forks source link

Preprocess text, Corpus or new separate widget: provide tool to convert British English to American or vice versa #1078

Open wvdvegte opened 3 months ago

wvdvegte commented 3 months ago

Is your feature request related to a problem? Please describe. When I'm working with a corpus that is a mixture of documents in American English and British English spelling, the two versions of the same word (e.g., behavior and behaviour) can influence analyses such as clustering because they may be treated as different words. Stemming might help in some cases but it's hard to find out when it does work and when it doesn't. As an example, I had a case where, in Annotated Corpus Map, both "organize" and "organise" were identified as keywords within a cluster. It would be better if only one version would be identified as an even more significant keyword

Describe the solution you'd like It would be better to have an option to automatically treat all the documents so that they are analyzed as written in only one version of English. I'm not sure if this should be an option in Corpus (where the language is selected first), in Preprocess Text (however this widget may be skipped if Document Embedding is used as suggested here) or as a separate widget altogether. The conversion can be easily realized using the code suggested here on Stack Overflow, using a list that is no longer available at its original location, but is still available in the www archive here.

Describe alternatives you've considered In the case I described before, I ended up with a quick fix going back to the source data (which was already in a table, fortunately, not in separate documents), find-and-replace "organis" by "organiz" and re-loading the data into Orange. But this is not a comprehensive solution to the problem.

wvdvegte commented 2 months ago

Addendum: the suggested code has an error in its function definition, and the dictionary is incomplete. And of course, there is an alternative to be considered: write a Python script to do the translation. The Python script in the attached workflow contains the corrected code and the complete dictionary. Nevertheless, it would be nice to have harmonization/harmonisation of English spelling as an easier-to-access option in the Text add-on. Also this script works on text in a table, it cannot process a corpus (I have no idea how to address a corpus in Python)

UK-US conversion.ows.zip

(edit: code adapted to replace whole words only, and convert to lowercase first. Added remark about corpus as input)