Disable asciination of titles/slugs

gerritsangel commented 10 years ago

Hello,

First of all, of course thanks for the great application!

Currently, Nikola converts every Non-ASCII title to ASCII only slugs. I actually have a hard time seeing the benefit of that. Most languages suffer quite a bit if they are converted to ASCII. And is this still really necessary? In 2014, many web servers have no problem displaying non-ascii filenames and for webbrowsers, it is absolutely no problem at all. Take a look at Wikipedia, there are non-ascii file names all the way.

The other problem is that many languages suffer quite severely during the conversation. Languages with the latin alphabet may only look weird, but for example Japanese is converted horribly (it is converted in Chinese style, which is completely uncomprehensable to a Japanese). The current Japanese conversation in effect gives the same result as if there were only random characters in the file name.

For example, if I have a post with the Japanese title 初めての日本語の記事 (hajimete no nihongo no kiji), I get the following slug: posts/chu-metenori-ben-yu-noji-shi.rst

I don’t know, this is so wrong it even hurts :D

It would be really good if this feature could be disabled. Readable titles are, in my opinion, really a must-have feature :)

Thank you :)

Kwpolska commented 10 years ago

Just change the slug yourself. We would have to do magic for all this, because some people (and one of them might be named Windows) might struggle here or there. I am all in favor of it’s 2014 in most cases, but Windows — and some misconfigured Unices — might not let us here.

gerritsangel commented 10 years ago

Yes, I am doing that right now, but well... :D

Would it maybe be possible to include a optional configuration option in conf.py, if it is not too much work? I am not so versed with the Nikola source code, but isn't it possible to check like this:

if convert_titles:
    slug = convert_slug(title)
else:
    slug = title

?

Of course, no problem if this is not done, it was just a proposal from my side :)

Kwpolska commented 10 years ago

Certainly doable. And might be done, soon. (it’s about 5 minutes’ work, and that’s quite generous).

gerritsangel commented 10 years ago

Wow, great, thank you so much :) I know that ascii is sometimes necessary, but if the environment completely supports unicode, I guess it should not be such a great problem.

Thanks!

Kwpolska commented 10 years ago

There, #1330 handles this, though it requires a lot of testing. Also, bonus question:

(it is converted in Chinese style, which is completely incomprehensible to a Japanese)

I thought Chinese and Japanese have separate glyphs. Is that not true?

gerritsangel commented 10 years ago

Well, Japanese also uses Chinese characters, but adds some own on top of it. Basically the “complicated” ones are Chinese and the easy ones Japanese.

For example in my example: 初 Chinese めての Japanese 日本語 Chinese の Japanese 記事 Chinese

The problem is that the pronunciation/reading of the Chinese characters in Japanese are not single one to one mappings. For example, 初 is always chu in Chinese, but in Japanese it is sometimes sho, sometimes hatsu, in the above context with めて following haji-mete etc.

Basically this creates the problem that the converter has to know if something is Japanese or Chinese. In my example this is clear because it contains Japanese characters. But it might be that it is a valid Japanese sentences (well, mostly captions or titles) without any Japanese characters, but not a Chinese character. Or, it is both valid Chinese and Japanese. And the transliteration between Japanese and Chinese is always completely different.

There are tools that can generate a more or less good transliteration of Japanese, but even they sometimes mess up things (especially difficult are personal names). This is under the assumption that the converter knows that it is Japanese and will then tell the Japanese-Latin-converter to convert it. Even then the problem is that Japanese usually do not read their own language in Latin letters, so it is really unusable. Ah, and we have the problem that Chinese and Japanese are written without word spacing, but in transcription, word spacing is usually/should be used. This again has to be done with much syntax parsing and usually with knowledge of the content itself, so a computer will have many problems with that.

Therefore, the best is that if the sentence would not be converted at all :)

Kwpolska commented 10 years ago

Thank you for this detailed explanation. As I said before, #1330 is an attempt at this. Please test it and report any bugs you manage to discover with this.

getnikola / nikola

Disable asciination of titles/slugs #1321