Error with Chinese title

renyuneyun commented 1 year ago

Basically, if you enter any Chinese title, it will refuse to save. For example, put 土豆 (potato) there, and click save. Error will show up:

Something went wrong, but it's not your fault. Try again!

Opening details gives:

Failed getting parent directory from 'solid://recipes/'

... (lots of seemingly stack trace text)

But if the title starts with something else, it will not be a problem. Also, turning a title of an existing recipe can save successfully. Chinese in other fields don't seem to be a problem, at least for now.

I assume this is due to some mechanism related to determining the ID of the recipe node in RDF?

NoelDeMartin commented 1 year ago

Hey, thanks for opening an issue!

I see, I think the problem is that I have a function to slugify recipe names, and I was very naive with the implementation because I didn't consider languages that aren't using the roman alphabet, I'm sorry about that >.<.

So for example, if you have "Mojo Picón" as a recipe name, the url would be solid://recipes/mojo-picon. But the implementation just replaces accents and removes any non-word character.

In this case, what do you think the slug should be like, how do Chinese apps usually do this? Should the url just be solid://recipes/土豆? And is there also some punctuation that could be removed in other scenarios? I'll have to investigate how to handle this in a better way, any tips would be appreciated :).

renyuneyun commented 1 year ago

Thanks for the explanation. That matches my hypothesis.

There is no punctuation as for accents in Chinese. Though there are punctuations like comma (，), period (。), etc.

Does the URL really matter? I mean, for example, if I changed the title of a recipe, will the URL also change to the new title?

If it won't change, I would expect most people don't care -- as long as it's a URL that works, that's fine.

Speaking of general practice, I can't think of any Chinese apps using RDF. But if considering anything that needs to be mapped, two main practices exist:

Keep the original characters, and encode using (e.g.) UTF-8 when ASCII is required;
Convert the characters to their Pinyin (without tones).

(Both methods have their problems. I'll briefly illustrate later.)

In general, there is always a tradeoff if going to map to latin. Most websites don't bother to deal with this. For those considered (that I'm aware of), most of them use the first practice.

But I would say, for most, this is not a big issue because they don't actually read the URL (partly because they know already that URL is often incomprehensible when containing Chinese, see example below). Therefore, using a random identifier is also completely acceptable.

Illustration of problem for both mapping practices

Use character itself

Wikipedia uses the original characters. For example, https://zh.wikipedia.org/wiki/土豆 is the page for potato. However, not everywhere accepts the characters directly, so they will often be converted to their UTF-8 escape sequence. For example, if you directly copy the URL from address bar (which shows the original character in modern browsers), and paste it in this textfield, it will be https://zh.wikipedia.org/wiki/%E5%9C%9F%E8%B1%86 . Quite ugly...

Use Pinyin

Pelican (a static site generator) converts the original characters to their Pinyin when forming the slug of the page. However, homophones exist a lot in Chinese (i.e. characters share the same pronunciation), thus they have the same Pinyin. And removing tones makes it even harder to distinguish. For example, still consider 土豆:

土 (earth) and 吐 (vomit/spit) both read as tǔ. If removing tone, it will be tu, which can collide with more characters, e.g. 凸 (tū; convex)；
豆 (bean) and 窦 (antrum) both read as dòu. If removing tone, it will be dou, which can collide with others like 抖 (dǒu; shake)。

In general, people will be able to understand a sentence purely in Pinyin (even with tones removed), because such collision will usually be resolved by (trying to) understanding the sentence (though not always, even while speaking with others). But simply reading Pinyin takes more time than reading the characters directly. And don't forget that Pinyin is designed for Mandarin, the official tongue/dialect for Chinese. People with other dialectal background (e.g. Cantonese) sometimes argue why Pinyin for Mandarin...

NoelDeMartin commented 1 year ago

I see, thanks for all the explanation, that's all very useful and interesting :).

Does the URL really matter? I mean, for example, if I changed the title of a recipe, will the URL also change to the new title?

At the moment, the url is minted the first time a recipe is created and it will never change. It doesn't really matter, but I liked the idea to have "pretty urls" when possible. So for example, if you create two recipes with the same name, the second one will have something appended to the end, like:

recipes/pizza
recipes/pizza-7cc154f1-849a-4608-8072-d296762741e6

But I was not handling the use-case of having an empty recipe name (because the field is required in the form). So it doesn't only break with Chinese characters, it also fails if you try to create a recipe which name only consists of punctuation for example.

Illustration of problem for both mapping practices

Thanks for illustrating both scenarios, I think I like the first one better but the problem I have is that I'm not sure how to remove non-letter characters. In PHP and other languages you can do /\p{L}/u, but it seems like that isn't so easy in Javascript.

The second option is what first came to mind, I found this concept called transliteration and maybe that's the equivalent to translating to Pinyin?

Regardless of which option I use for slugs, I think I'll need to use transliteration in some situations anyways. For example, I have this same problem for search because I do some text normalization before matching strings so that you can search without writing accents.

renyuneyun commented 1 year ago

I see. I also like the idea of "pretty URLs", but yes, it will be hard when considering different factors.

Anyway, I have no real issues with the first option.

non-letter characters

I know there is simple regex to select CJK (Chinese/Japanese/Korean) characters, e.g. https://stackoverflow.com/questions/2718196/find-all-chinese-text-in-a-string-using-python-and-regex . But maybe you also need other ones (e.g. Arabic?)

transliteration and maybe that's the equivalent to translating to Pinyin?

Indeed. Based on their examples, this seems to be true.

And speaking of CJK, that also leads to potential issues with Japanese. You see, Japanese (writing) is a combination of:

Kanji: Japanese variant of Chinese characters; only small difference;
Kana: the phonetic representation, if I understand correctly.

So when encountering the characters that exist in both regular Chinese and Kanji, how would the transliteration work? Japanese pronounce them differently with Chinese (though there are often historical relations...). So will it be converted to the Mandarin Pinyin or their Japanese pronunciation (Romaji?)?

Maybe your Soukai library is an example of this. I believe you found the word "爽快" from Japanese, but it is also a valid Chinese word, probably meaning similar things -- fluent; fast (responding); refreshing. Soukai is probably the Romaji for Japenese? It's Pinyin is shuang kuai.

NoelDeMartin commented 1 year ago

I know there is simple regex to select CJK (Chinese/Japanese/Korean) characters, e.g. https://stackoverflow.com/questions/2718196/find-all-chinese-text-in-a-string-using-python-and-regex . But maybe you also need other ones (e.g. Arabic?)

Yeah, that may fix CJK but it won't work with other languages. I've been looking into some code I wrote a while ago for searching text in a context with multiple languages and I realized I did it the other way around. Instead of trying to keep known characters, or remove a range of characters, I just removed punctuation (like ,, ., ', etc.) and replaced diacritics (á --> a, ö --> o, etc.). So I think that's what I'll do, both for search and slugs.

And speaking of CJK, that also leads to potential issues with Japanese. You see, Japanese (writing) is a combination of:

Kanji: Japanese variant of Chinese characters; only small difference; Kana: the phonetic representation, if I understand correctly.

Yes I know about that, I've actually been trying to learn Japanese for a while so I'm familiar with the basics. That's why many of my projects have Japanese names :). The irony of this app being called "Umai" and failing to accept Kanji is not lost on me xD.

So when encountering the characters that exist in both regular Chinese and Kanji, how would the transliteration work?

I see what you mean and it's probably an issue >.<. I think I'll stay away from transliteration. I'm honestly surprised that it took me this long to realize this, and that the practice to "slugify" text is so widespread without this accompanying disclaimer :/. It reminds me of this blog post: Falsehoods Programmers Believe About Names. Thanks for bringing awareness!

Maybe your Soukai library is an example of this. I believe you found the word "爽快" from Japanese, but it is also a valid Chinese word, probably meaning similar things -- fluent; fast (responding); refreshing. Soukai is probably the Romaji for Japenese? It's Pinyin is shuang kuai.

Yes, that's exactly it :). I found it using Jisho. It's fun to know that it's called shuang kuai in Chinese!

NoelDeMartin commented 1 year ago

Hey, thanks again for reporting this issue. I've published the changes I mentioned last time, so it should work in the live version now. I'm closing the issue, let me know if you find any more problems.

NoelDeMartin / umai