getpelican / pelican

Static site generator that supports Markdown and reST syntax. Powered by Python.
https://getpelican.com
GNU Affero General Public License v3.0
12.49k stars 1.81k forks source link

Unicode in tags, categories #332

Closed neoascetic closed 10 years ago

neoascetic commented 12 years ago

I faced with some errors. If I have non-ascii characters in category name or in any tag name (тесты, for example), corresponding files don't generated (in fact, they write written to .html file). I found that this is because of slugify function - it removes some Unicode characters (Cyrillic in my case). Since this is based on Django function, Django have the bug too.

I found Unidecode package which can help with handling this filenames by transliterating them before slugification:

import unidecode
value = unidecode.unidecode(value)  # 'тесты' becomes 'testy'

What's your thoughts about this?

Also I have errors in logging module while processing non-ascii characters in report.msg, but that is another story.

MeirKriheli commented 12 years ago

Instead of slugifying them (which probably will have no meaning, just tried with some Hebrew), wouldn't it better to keep them as is and url quote them ?

This will be better for search engines, as it keeps the original meaning.

plucury commented 12 years ago

I also found this problem when I use non-ascii characters as article title.The filename of article page is wrong.

almet commented 12 years ago

+1 on the solution you're proposing.

neoascetic commented 12 years ago

Here another drawback with this. For example, for words тЕсты and тЭсты (note second letter) unidecode produce one word testy. In most cases for Russian meaning is same, this a just typo, but I don't know about other languages.

So, pelican will rewrite content for one tag instead another if they produce same slug, what is not good, as I think.

plucury commented 12 years ago

I also tried this method for Chinese.It translated 测试 to Ce Shi(Chinese phonetic alphabet).But when there are two or more words have same pronunciation this method couldn't distinguish them.So I agree with @MeirKriheli .

neoascetic commented 12 years ago

We can use this function as slugify:

def slugify(value):
    """ 
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    value = Markup(value).striptags()
    value = re.sub('\s*-\s*', '-', value, re.UNICODE)
    value = re.sub('[\s/]', '-', value, re.UNICODE)
    value = re.sub('(\d):(\d)', r'\1-\2', value, re.UNICODE)
    value = re.sub(r'[\'"?,:!@#~`+=$%^&\\*()\[\]{}<>]', '', value, re.UNICODE)
    return value.lower()

It saves and slugify all non-ascii characters correctly, so tags works fine. But, work with utf-8 in pelican very confusion for me. For example, tags with non-ascii names writes to files right, while categories writes to utf-8 files in fact, but with garbage as filenames (\321\200\320\276\320\266\320\260.html instead of рожа.html)

Is this because you support Window too? Maybe it would be good to provide some setting, where user can specify his encoding? (As for me, I think that support of utf8 is enough)

MeirKriheli commented 12 years ago

Looks like it's a problem urllib.quote() it at slugify, since the % sign is escaped once again upon generation. Here's what I have so far, but no go yet:

def slugify(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    value = Markup(value).striptags()
    if type(value) != unicode:
        value = value.decode('utf-8')
    value = re.sub('[^\w\s-]', '', value, flags=re.U).strip().lower()
    return urllib.quote(re.sub('[-\s]+', '-', value, flags=re.U).encode('utf-8'))
almet commented 12 years ago

url quoting is probably worse than converting to some ascii chars. If you come to have a working solution that works for you, please tell me. Whatever works :)

MeirKriheli commented 12 years ago

I don't think we can find a good solution to that problem.

One alternative is to figure a way to explicitly specify slugs for tags as well. This method already works well for article slug (when working with multiple languages).

How about optionally separating tag and slug with | ? Something like:

:tags: פייתון|python, שלום עולם|hello-world

This solves the issue leaving the decision (and power) in the author's hands.

neoascetic commented 12 years ago

Almost all modern browsers display non-ascii in address bar nice

MeirKriheli commented 12 years ago

It's not an issue of browsers (client side), but one of server side. There's a problem generating those filenames which will also be served correctly by various servers (several tests here failed).

Plus, a simple slug is way cleaner and concise (send such links via email for example shows them in their ugliness).

utdemir commented 12 years ago

I'm getting an error for specifying category names with unicode characters. Same error with markdown tag "Category: " and directory name containing unicode chars.

Why couldn't you just do from __future__ import unicode_literals ?

MeirKriheli commented 12 years ago

There's another problem, if the same tag is specified in different languages, it's been overwritten by the articles in another language.

I would prefer having the tag page generate for each language, with the articles relevant to that language, and have settings like:

TAG_LANG_URL TAG_LANG_SAVE_AS

That way, one can specify the tags in the way I've specified above, with explict tag name and slug, and have them generated for each language.

I'll try to whip something up.

almet commented 11 years ago

@MeirKriheli any news on that?

hooli commented 11 years ago

I ran into similar problems, so made some changes at https://github.com/hooli/pelican/tree/dev-utf8 which might help.

Quick summary:

1) Assumed output file system supports UTF-8 Added a check to catch reserved characters not allowed in the file system. I defaulted this to remove "/" and ":" (for HFS+) from metadata, but added two settings so it can be configured. I think Microsoft also suggest avoiding '*?"<>|', but NTFS can use them.

Files would obviously need to be FTP'ed with UTF-8 encoding on, to preserve the values.

2) Split out URL and SAVE_AS, so SAVE_AS is the plain UTF-8 filename and URL gets % encoded. Could make this a setting (URI encoded or IRI unicode links) since International links don't really need quoting. Tested on Chrome and Safari and it swaps between either encoded or unicode links fine (ie %C3%A9 will be shown as é) Could encoding space as %20 (the browsers do this anyhow so the URL is not split on copy and paste) There doesn't seem to be a recommendation either way for what should be quoted for IRI. Chrome and Safari visually encode different characters from UTF8 links, but they all work without encoding. HTML5 mentions reserve characters '\/:*?"<>|' should be escape, but guess thats URI. Would make pages more backwards compatible, but unsure if they are also referring to IRI.

It seems quite easy to accidentally convert a unicode value to a string representation, which then can cause a problem later if you need to output UTF-8, since it then finds a unicode value in an ascii string. I changed some format statements since "x{0}" % u"é" gives a str rather than a unicode value.

3) The Mac uses a case-insensitive file system, so noticed some tag files get overwriting by others if case is different. Added a setting and a quick hack to make the URLWrapper lookups case-insensitive. The case of the displayed values aren't effected.

MeirKriheli commented 11 years ago

@ametaireau : Nope, sorry. Needed more features for multilingual sites so I started my own static generator: https://github.com/MeirKriheli/statirator

Using it to generate my own web site: http://meirkriheli.com/en/

justinmayer commented 11 years ago

Would somebody else in this thread care to work on this?

@neoascetic: What do you think? Would you like to work on this?

neoascetic commented 11 years ago

No, sorry. I have no free time.

Zuckonit commented 11 years ago

Such two problems caught in my archlinux

  1. Missing dependencies for md
  2. 'ascii' codec can't decode byte 0xe6 in position 63: ordinal not in range(128)
russkel commented 11 years ago

729 is probably related.

justinmayer commented 10 years ago

This issue has been open for two years without significant activity, so I'm closing it. If anyone would like to actively work on it, pull requests are welcome. (^_^)

grtfou commented 9 years ago

Sorry for my English. I have one question. I know URL will convert to lowercase by slugify function. But how to keep original category string for display?

Category: [Happy Day]
<a href="http://myweb/category/happy-day.html">[happy day]</a>
Category: Happy Day
<a href="http://myweb/category/happy-day.html">Happy day</a>

How to show? Thank you.

<a href="http://myweb/category/happy-day.html">[Happy Day]</a>
smartass101 commented 9 years ago

The Category object has a name attribute, so you could use {{category.name}} in your templates. The default CATEGORY_URL uses the slug attribute, that's why it slugifies it.

Next time, please ask support questions on our IRC channel or open separate issues. Asking for unrelated support by commenting on old, closed and unrelated issues is really bad practice.