codebuddies / backend

CodeBuddies back-end
https://codebuddies.org
GNU General Public License v3.0
20 stars 25 forks source link

[Tagging] Slugify Mangling Tags Written in Brahmic scripts / abugida writing systems (Vowels are being stripped) #123

Closed BethanyG closed 4 years ago

BethanyG commented 4 years ago

The expectation of the slugify(allow_unicode=True) call in the tagging/models.py CustomTag() class is to pass any tag written in unicode characters through mostly untouched except for

  1. removal of leading and trailing spaces
  2. replacing spaces between words with the - character
  3. lower-casing the words
  4. ~making sure the resulting tag is "unique"~ (this is actually the job of the taggit manager code)

However, when tags are written in a Brahmic / abugida writing system (examples include Hindi, Telugu, Thai, Malayalam, Tamil, Kannada, and more) this code is mangling the result by removing the diacritical marks and vowels. Going off the Google translations here, as I am not a speaker of any of the example languages.

The slug of "हिंदी में जानकारी" ("Information in Hindi") is being returned as "-जनकर" which isn't a word. Attempting to then slugify "हिंदी-में-जानकारी" ("information-in-Hindi"), I get back "हद-म-जनकर" ("half-dead").

A similar thing seems to be happening with the Telugu language - "స్వయంచాలక" ("automated") becomes "సవయచలక" -- which isn't a word.

Additional examples:

Kannada: "ಡೇಟಾಬೇಸ್ ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ" becomes "ಡಟಬಸ-ನರವಹಣ-ವಯವಸಥ" Malayalam: "ഡാറ്റാബേസ് മാനേജുമെന്റ് സിസ്റ്റം" becomes "ഡററബസ-മനജമനറ-സസററ" Thai: "ฐานข้อมูล" becomes "ฐานขอมล" Burmese: "ဒေတာဘေ့စစီမံခန့်ခွဲမှုစနစ်" becomes "ဒတဘစစမခနခမစနစ"

The real kicker here is that none of these languages really have a "lower case" vs "upper case" distinction, really.

However, slugifying the Hebrew "מערכת ניהול מסדי נתונים" ("database management system") results in the expected "מערכת-ניהול-מסדי-נתונים".

And slugifying the Arabic "قاعدة البيانات" ("Database"), results in "قاعدة-البيانات" ("Database").

Tests with traditional & simplified Chinese characters, Korean, and multiple Japanese variants are also fine, as is Persian.

We may need to move to either a combination of slugify with unicode and transliteration, do a run-around of slugify for certain languages - or scrap slugification altogether. Very open to suggestions or discussions on this.

chris48s commented 4 years ago
  1. removal of leading and trailing spaces
  2. replacing spaces between words with the - character
  3. lower-casing the words
  4. making sure the resulting tag is "unique"

The other key property of a slug (perhaps more important than any of those) is it should contain only characters which are URL-safe or have no special meaning in the context of a URL.

or scrap slugification altogether

Tbh. I wonder if we're fixating on a feature that may not be that important in the context of this project. See my question in https://github.com/codebuddies/backend/pull/121#discussion_r400403287 Understanding more about what you want to use the slug for will probably help inform the solution, or possibly lack of need for one.

BethanyG commented 4 years ago

So. Unlike a blog or an article DB, I don't think we're going to have a whole lot of need to have a tag >> slug >> URI for every tag.

This was more a way for us to cut down on a few of the duplicate/overlapping tags we know will be coming our way, and have a "standard" way of referring to/composing names for tags.

To be brutally honest, I was trying to use/repurpose already existing "uniquification" code in taggit, hit some horrid snags around unicode, and then went down a rabbit hole of different character sets. Now here we are.

I'm tempted at this point to write a RegEx (which tells you all you need to know about my state of mind 😱 ) that will just make the transforms to what is stated above. Or not have a slug at all ... except that I do want some sanitizing and uniquification done. So maybe we focus there.

If we decided that having tag>>slug>>URI was actually needed, we could use iri_to_uri() to make sure that the URI/URLs produced are "safe". We probably should do that as a matter of course, since we are using characters that are IRI (korean, cyrillic, greek, etc. etc.) as opposed to URI.

BethanyG commented 4 years ago

Whelp. I did what I said I wasn't going to do, and wrote a regex to identify and exempt the abugidas from Djangos slugify(). See attached PR for details..