TryGhost / Ghost

Independent technology for modern publishing, memberships, subscriptions and newsletters.
https://ghost.org
MIT License
46.97k stars 10.22k forks source link

bug caused by unidecode's bug #1986

Closed xuduo35 closed 10 years ago

xuduo35 commented 10 years ago

I post a new article with title"第一篇". The article link will become 'http://127.0.0.1:2368/Di%20[/?]%20Pian%20/', and 404 error happen. After some check,Ithink it's caused by module unidecode. Test code(in index.js):


unidecode = require('unidecode');
console.log("unidecode(第一篇) = " + unidecode("第一篇") + "\n");

Result:


E:\node_js\Ghost>npm start

> ghost@0.4.0 start E:\node_js\Ghost
> node index

unidecode(第一篇) = Di [?] Pian

It seems unidecode cannot decode chinese word '一" correctly.

halfdan commented 10 years ago

Hi @xuduo35, this is indeed an issue with unidecode.

The character is U4E00 which is undefined in the unidecode files: https://github.com/FGRibreau/node-unidecode/blob/master/data/x4e.js

I looked up the translation of what you wrote on Google Translate and it transliterated the text to: Dì yī piān.

Would you agree that "yi" could serve as transliteration for the 一 character?

voronoipotato commented 10 years ago

一 is yi, you can double check by using the chinese keyboard and typing the pinyin

ErisDS commented 10 years ago

一 is yi, can just about remember this from my Mandarin lessons

Surely we're going to run into this problem for all the un-transliterated characters in https://github.com/halfdan/node-unidecode/blob/d905ec9f27b597ffeb446ff2dfdc75200eeeccba/data/x4e.js?

Perhaps @xuduo35 or @wangsai, or one of our other Chinese speaking contributors could look at transliterating the missing characters? What's the easiest way to get a list of the characters which are currently transliterated as [?] ?

xuduo35 commented 10 years ago

I checked "第一篇" with perl module Text::Unidecode, it also output 'Di [?] Pian'. And this work "一" is special, it can be pronounce. But there are some other chinese words, even me don't know how to pronounce. I think it's okay for unidecode to use [?] to decode words of this type. But this will cause bad url. So I think we can just replace [?] to other char after decode, like '-' or '_'. It's more important to make URL right. @ErisDS How about your opinion?

Just add one line to replace [?] to '-'.

WARNING: terminal is not fully functional
diff --git a/core/server/models/base.js b/core/server/models/base.js
index e03a164..f03d9d9 100644
--- a/core/server/models/base.js
+++ b/core/server/models/base.js
@@ -226,6 +226,7 @@ ghostBookshelf.Model = ghostBookshelf.Model.extend({
         slug = slug.charAt(slug.length - 1) === '-' ? slug.substr(0, slug.lengt
         // Remove non ascii characters
         slug = unidecode(slug);
+        slug = slug.replace(/\[\?\]/, "-");
         // Check the filtered slug doesn't match any of the reserved keywords
         slug = /^(ghost|ghost\-admin|admin|wp\-admin|wp\-login|dashboard|logout
             .test(slug) ? slug + '-post' : slug;
xuduo35 commented 10 years ago

I mean, there exist some unicodes which are not a complete word. They are just to used to construct other word in our language(I think the same situation exists in Japanese or Korean). They cannot be pronounced, so there is not translation for them. Check these with command 'find node_modules/unidecode/data|xargs grep '[?]' ', there are too many. I think no way to fill them all out.