Closed neoascetic closed 10 years ago
Instead of slugifying them (which probably will have no meaning, just tried with some Hebrew), wouldn't it better to keep them as is and url quote them ?
This will be better for search engines, as it keeps the original meaning.
I also found this problem when I use non-ascii characters as article title.The filename of article page is wrong.
+1 on the solution you're proposing.
Here another drawback with this. For example, for words тЕсты
and тЭсты
(note second letter) unidecode
produce one word testy
. In most cases for Russian meaning is same, this a just typo, but I don't know about other languages.
So, pelican
will rewrite content for one tag instead another if they produce same slug, what is not good, as I think.
I also tried this method for Chinese.It translated 测试
to Ce Shi
(Chinese phonetic alphabet).But when there are two or more words have same pronunciation this method couldn't distinguish them.So I agree with @MeirKriheli .
We can use this function as slugify:
def slugify(value):
"""
Normalizes string, converts to lowercase, removes non-alpha characters,
and converts spaces to hyphens.
"""
value = Markup(value).striptags()
value = re.sub('\s*-\s*', '-', value, re.UNICODE)
value = re.sub('[\s/]', '-', value, re.UNICODE)
value = re.sub('(\d):(\d)', r'\1-\2', value, re.UNICODE)
value = re.sub(r'[\'"?,:!@#~`+=$%^&\\*()\[\]{}<>]', '', value, re.UNICODE)
return value.lower()
It saves and slugify all non-ascii characters correctly, so tags works fine.
But, work with utf-8
in pelican very confusion for me. For example, tags with non-ascii names writes to files right, while categories writes to utf-8
files in fact, but with garbage as filenames (\321\200\320\276\320\266\320\260.html
instead of рожа.html
)
Is this because you support Window too? Maybe it would be good to provide some setting, where user can specify his encoding? (As for me, I think that support of utf8
is enough)
Looks like it's a problem urllib.quote() it at slugify, since the % sign is escaped once again upon generation. Here's what I have so far, but no go yet:
def slugify(value):
"""
Normalizes string, converts to lowercase, removes non-alpha characters,
and converts spaces to hyphens.
"""
value = Markup(value).striptags()
if type(value) != unicode:
value = value.decode('utf-8')
value = re.sub('[^\w\s-]', '', value, flags=re.U).strip().lower()
return urllib.quote(re.sub('[-\s]+', '-', value, flags=re.U).encode('utf-8'))
url quoting is probably worse than converting to some ascii chars. If you come to have a working solution that works for you, please tell me. Whatever works :)
I don't think we can find a good solution to that problem.
One alternative is to figure a way to explicitly specify slugs for tags as well. This method already works well for article slug (when working with multiple languages).
How about optionally separating tag and slug with | ? Something like:
:tags: פייתון|python, שלום עולם|hello-world
This solves the issue leaving the decision (and power) in the author's hands.
Almost all modern browsers display non-ascii in address bar nice
It's not an issue of browsers (client side), but one of server side. There's a problem generating those filenames which will also be served correctly by various servers (several tests here failed).
Plus, a simple slug is way cleaner and concise (send such links via email for example shows them in their ugliness).
I'm getting an error for specifying category names with unicode characters. Same error with markdown tag "Category: " and directory name containing unicode chars.
Why couldn't you just do from __future__ import unicode_literals
?
There's another problem, if the same tag is specified in different languages, it's been overwritten by the articles in another language.
I would prefer having the tag page generate for each language, with the articles relevant to that language, and have settings like:
TAG_LANG_URL TAG_LANG_SAVE_AS
That way, one can specify the tags in the way I've specified above, with explict tag name and slug, and have them generated for each language.
I'll try to whip something up.
@MeirKriheli any news on that?
I ran into similar problems, so made some changes at https://github.com/hooli/pelican/tree/dev-utf8 which might help.
Quick summary:
1) Assumed output file system supports UTF-8 Added a check to catch reserved characters not allowed in the file system. I defaulted this to remove "/" and ":" (for HFS+) from metadata, but added two settings so it can be configured. I think Microsoft also suggest avoiding '*?"<>|', but NTFS can use them.
Files would obviously need to be FTP'ed with UTF-8 encoding on, to preserve the values.
2) Split out URL and SAVE_AS, so SAVE_AS is the plain UTF-8 filename and URL gets % encoded. Could make this a setting (URI encoded or IRI unicode links) since International links don't really need quoting. Tested on Chrome and Safari and it swaps between either encoded or unicode links fine (ie %C3%A9 will be shown as é) Could encoding space as %20 (the browsers do this anyhow so the URL is not split on copy and paste) There doesn't seem to be a recommendation either way for what should be quoted for IRI. Chrome and Safari visually encode different characters from UTF8 links, but they all work without encoding. HTML5 mentions reserve characters '\/:*?"<>|' should be escape, but guess thats URI. Would make pages more backwards compatible, but unsure if they are also referring to IRI.
It seems quite easy to accidentally convert a unicode value to a string representation, which then can cause a problem later if you need to output UTF-8, since it then finds a unicode value in an ascii string. I changed some format statements since "x{0}" % u"é" gives a str rather than a unicode value.
3) The Mac uses a case-insensitive file system, so noticed some tag files get overwriting by others if case is different. Added a setting and a quick hack to make the URLWrapper lookups case-insensitive. The case of the displayed values aren't effected.
@ametaireau : Nope, sorry. Needed more features for multilingual sites so I started my own static generator: https://github.com/MeirKriheli/statirator
Using it to generate my own web site: http://meirkriheli.com/en/
Would somebody else in this thread care to work on this?
@neoascetic: What do you think? Would you like to work on this?
No, sorry. I have no free time.
Such two problems caught in my archlinux
This issue has been open for two years without significant activity, so I'm closing it. If anyone would like to actively work on it, pull requests are welcome. (^_^)
Sorry for my English. I have one question. I know URL will convert to lowercase by slugify function
. But how to keep original category string for display?
Category: [Happy Day]
<a href="http://myweb/category/happy-day.html">[happy day]</a>
Category: Happy Day
<a href="http://myweb/category/happy-day.html">Happy day</a>
How to show? Thank you.
<a href="http://myweb/category/happy-day.html">[Happy Day]</a>
The Category
object has a name
attribute, so you could use {{category.name}}
in your templates. The default CATEGORY_URL
uses the slug
attribute, that's why it slugifies it.
Next time, please ask support questions on our IRC channel or open separate issues. Asking for unrelated support by commenting on old, closed and unrelated issues is really bad practice.
I faced with some errors. If I have non-ascii characters in category name or in any tag name (
тесты
, for example), corresponding files don't generated (in fact, they write written to.html
file). I found that this is because ofslugify
function - it removes some Unicode characters (Cyrillic in my case). Since this is based on Django function, Django have the bug too.I found Unidecode package which can help with handling this filenames by transliterating them before slugification:
What's your thoughts about this?
Also I have errors in logging module while processing non-ascii characters in
report.msg
, but that is another story.