bgarrels / textpattern

Automatically exported from code.google.com/p/textpattern
0 stars 0 forks source link

problems with using multi-language character in article title.(with permalink options) #65

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.Make an article with using multi-language character in the "title"
2.Use a permalink that includes "title" (ex. /year/day/month/title)
3.Go to see the top page of textpattern
4.The articles will are made successfully, but trying to access the article
fails because the title is not encoded correctly

What is the expected output? What do you see instead?
EXPECT:To show the correct url.
INSTEAD:an decode error?

What version of the product are you using? On what operating system?
Textpattern version: 4.2.0 (r3275)
CentOS5

Please provide any additional information below.
Fix the what comes into `textpattern.url_title` somewhere.
When I see the following data in mysql, they are inputted as "-",
which i think is not too hard to fix.

There are same problems with the `txp_section.section_name`, which can be
easily check the problem.
Setting any multi-language character for Section name from the admin panal
(Presentation -> Sections ), after saving the information it comes out "-".

I checked the problems using Japanese.

Regards,
nn--

Original issue reported on code.google.com by dev.nnnn@gmail.com on 28 Apr 2010 at 4:53

GoogleCodeExporter commented 9 years ago
I think (someone correct me if I'm wrong here) that the problem is there's no
"dumbdown" facility for Japanese in TXP.

There's a function in lib/txplib_misc.php called dumbDown() which converts 
'foreign'
(predominantly European/Middle-eastern) character sequences to their (rough) 
ascii
equivalents so that a url-only title can be automatically constructed. Also 
look in
the file lib/i18n-ascii.txt for more.

Since there's no dumbdown for Japanese (among others), when you type an entirely
Japanese article title (e.g. 金魚) and save it, TXP can't automatically 
create a URL
title for the article. In the latest SVN version you'll see a warning that the
article contains an empty url-only title. IN TXP 4.2.0 you will probably get an
article with an erroneous single dash (see Issue 36, now fixed).

The upshot is that you'll have to currently type a URL-only title in manually 
that
conforms to <a href="http://www.faqs.org/rfcs/rfc1738.html">RFC 1738</a>.

If you have any idea how dumbDown() / i18n-ascii.txt can be made to convert 
Japanese
characters into ascii then please send them over!

Original comment by stefdawson on 4 May 2010 at 1:55

GoogleCodeExporter commented 9 years ago
Thanks for the reply.

I've been testing stuff after your reply, which really helped me solve? the 
problem.

Mking a new article with a title that includes Japanese doesn't work as 
reported, but
editing the article from the advanced option "url-only title", everything works 
fine,
even using only Japanese characters.
I'm not really sure, but urlencode() at line 2803 in the file
publish/taghandlers.php, is probably making things go ok. So, i just commented 
out
line 722 in the file lib/txplib_misc.php -> $text = sanitizeForUrl($text); to 
not
dumbDown() and do all the replacement, which seems to work fine right now.
If the titles are getting urlencode, i dont see why we need dumbDown() and all 
the
other stuff in the function sanitizeForUrl(). But, considering the fact that
somepeople(or many people) might prefer using dumbDown, adding an option 
whether to
dumbDown or just urlencode, might work out.

I have been testing my Japanese url-title at my website as followed, which you 
can
see that it is actually working fine.
http://nnnnn.me/textpattern/normal/

Also, related to this topic, i suggest to use the function rawurlencode() 
instead of
urlencode() for TXP in order to follow RFC1738 better.
http://www.php.net/manual/en/function.rawurlencode.php

urlencode() could be found in the following files and lines.
publish.php:275,277,280,283,287,386
include/txp_discuss.php:378
include/txp_file.php:210,211
include/txp_image.php:200
include/txp_plugin.php:99,102,227
lib/txplib_html.php:102,103,114,127,130,164,1912,1972,1979,1986,2000
publish/taghandlers.php:2802,2803

thank you

Original comment by dev.nnnn@gmail.com on 7 May 2010 at 9:14

GoogleCodeExporter commented 9 years ago
Change set #3344 introduces a fallback for languages which lack a suitable 
transliteration.

The two instances of urlencode() used in include/txp_file.php plus the ones in 
publish.php have counterparts using urldecode() and are just used internally. 
Thus, 
while it may not fulfill RFC 1738, the method used for encoding is 
insignificant as 
long as these both match. 

I haven't looked into the other instances you mentioned, so please open a 
separate 
issue if you discover functional deficits stemming from our use of urlencode()

Original comment by r.wetzlmayr on 7 May 2010 at 3:19