apostrophecms-legacy / apostrophe-pages

apostrophe-pages adds page rendering to the Apostrophe content management system. The apostrophe-pages module makes it easy to serve pages, fetching the requested page and making its content areas available to your page templates along with any other attributes of interest.
MIT License
5 stars 9 forks source link

UTF8 slugs? #6

Open etodanik opened 10 years ago

etodanik commented 10 years ago

Right now if the title is 100% non-latin characters, no slug is created. Also, no slug editing is permitted upon creation of a page.

I'm not sure this is the best approach, since the resulting slug is "none", and for each and every single page, an end-user would have to go back to "Page Settings" and add a slug.

I see two solutions: 1) Add slug field that would be auto-populated from the title, and would be validated to be non-empty 2) Enable non-latin characters in the automatically created slug

and: 3) Both, with some sort of setting allowing or disallowing UTF8 in slugs globally (I'm a fan of this one ;))

Internationally it's more and more common to see slugs in non latin characters.

boutell commented 10 years ago

You're right. We would like to accept a pull request to fix apos.slugify so that it does something rational with non-latin characters. The trouble is that JavaScript regular expressions are just so terrible. \w doesn't upgrade smoothly to deal with non-latin character sets, for instance.

It looks like xregexp is the answer:

http://xregexp.com/plugins/#unicode

This would allow the use of unicode categories that offer a much broader concept of "letters" and "numbers," which should result in a reasonable slugifier for more languages.

Right now in search.js the slugify method does this:

var r = "[^A-Za-z0-9" + RegExp.quote(options.allow) + "]"; var regex = new RegExp(r, 'g'); s = s.replace(regex, options.separator);

That clobbers all the characters that are not letters or digits and replaces them with the separator (-) and later we stomp consecutive separators.

This is a good approach, we just need to use xregexp and Unicode categories in place of "A-Za-z0-9", and use xregexp rather than vanilla regular expressions.

We also need to convert to lowercase in a more language-independent way (toLowerCase isn't very smart).

It might be best to exclude unwanted things specifically (like the "punctuation" unicode category) rather than allowing only a small number of things, since Chinese ideograms might be valid in slugs and they are not letters or numbers...

So you see it's a bit complicated.

(In Apostrophe 1.5, built in PHP, we got away with using the Unicode upper and lower case letter and number categories and no one complained about the Chinese thing. But that doesn't mean we shouldn't do better if we can.)

I hope this helps you down the road to patching apos.slugify. If not, it'll help me eventually. (:

On Sun, Jan 19, 2014 at 5:55 PM, Danny Povolotski notifications@github.comwrote:

Right now if the title is 100% non-latin characters, no slug is created. Also, no slug editing is permitted upon creation of a page.

I'm not sure this is the best approach, since the resulting slug is "none", and for each and every single page, an end-user would have to go back to "Page Settings" and add a slug.

I see two solutions: 1) Add slug field that would be auto-populated from the title, and would be validated to be non-empty 2) Enable non-latin characters in the automatically created slug 3) Both, with some sort of setting allowing or disallowing UTF8 in slugs globally (I'm a fan of this one ;))

Internationally it's more and more common to see slugs in non latin characters.

— Reply to this email directly or view it on GitHubhttps://github.com/punkave/apostrophe-pages/issues/6 .

Tom Boutell Lead Developer P'unk Avenue 215 755 1330 punkave.com window.punkave.com

boutell commented 10 years ago

Excuse me, I was wrong about this point: "toLowerCase" DOES understand Unicode. That's a relief. We only have to fix our removal of unwanted characters to be more Unicode-aware.

On Sun, Jan 19, 2014 at 6:13 PM, Tom Boutell tom@punkave.com wrote:

You're right. We would like to accept a pull request to fix apos.slugify so that it does something rational with non-latin characters. The trouble is that JavaScript regular expressions are just so terrible. \w doesn't upgrade smoothly to deal with non-latin character sets, for instance.

It looks like xregexp is the answer:

http://xregexp.com/plugins/#unicode

This would allow the use of unicode categories that offer a much broader concept of "letters" and "numbers," which should result in a reasonable slugifier for more languages.

Right now in search.js the slugify method does this:

var r = "[^A-Za-z0-9" + RegExp.quote(options.allow) + "]"; var regex = new RegExp(r, 'g'); s = s.replace(regex, options.separator);

That clobbers all the characters that are not letters or digits and replaces them with the separator (-) and later we stomp consecutive separators.

This is a good approach, we just need to use xregexp and Unicode categories in place of "A-Za-z0-9", and use xregexp rather than vanilla regular expressions.

We also need to convert to lowercase in a more language-independent way (toLowerCase isn't very smart).

It might be best to exclude unwanted things specifically (like the "punctuation" unicode category) rather than allowing only a small number of things, since Chinese ideograms might be valid in slugs and they are not letters or numbers...

So you see it's a bit complicated.

(In Apostrophe 1.5, built in PHP, we got away with using the Unicode upper and lower case letter and number categories and no one complained about the Chinese thing. But that doesn't mean we shouldn't do better if we can.)

I hope this helps you down the road to patching apos.slugify. If not, it'll help me eventually. (:

On Sun, Jan 19, 2014 at 5:55 PM, Danny Povolotski < notifications@github.com> wrote:

Right now if the title is 100% non-latin characters, no slug is created. Also, no slug editing is permitted upon creation of a page.

I'm not sure this is the best approach, since the resulting slug is "none", and for each and every single page, an end-user would have to go back to "Page Settings" and add a slug.

I see two solutions: 1) Add slug field that would be auto-populated from the title, and would be validated to be non-empty 2) Enable non-latin characters in the automatically created slug 3) Both, with some sort of setting allowing or disallowing UTF8 in slugs globally (I'm a fan of this one ;))

Internationally it's more and more common to see slugs in non latin characters.

— Reply to this email directly or view it on GitHubhttps://github.com/punkave/apostrophe-pages/issues/6 .

Tom Boutell Lead Developer P'unk Avenue 215 755 1330 punkave.com window.punkave.com

Tom Boutell Lead Developer P'unk Avenue 215 755 1330 punkave.com window.punkave.com

etodanik commented 10 years ago

Well, how does this look: http://codepen.io/anon/pen/CEbqy

etodanik commented 10 years ago

Also, what's the reason for not displaying a slug upon creation of a page? Is it technical, ux, bug?

boutell commented 10 years ago

Not bad! It lets dollar signs through, I'm not sure why. That might not be a safe character in every filesystem. It's definitely a pain when it shows up in Unix just because you have to escape carefully to touch it from the command line...

It would also let through some other garbage like nulls and tabs and newlines and carriage returns which don't seem to be in the Unicode "space" category. We should dump everything between \x00 and \x20.

Our general philosophy is that Apostrophe should just do the right thing for you, including picking good slugs and making them unique enough (which it does automatically), and that power user features shouldn't be too "in your face" for a new user adding a page for a first time. But I don't know how strongly we really feel about slug editing for new pages in particular.

On Sun, Jan 19, 2014 at 7:07 PM, Danny Povolotski notifications@github.comwrote:

Also, what's the reason for not displaying a slug upon creation of a page? Is it technical, ux, bug?

— Reply to this email directly or view it on GitHubhttps://github.com/punkave/apostrophe-pages/issues/6#issuecomment-32726983 .

Tom Boutell Lead Developer P'unk Avenue 215 755 1330 punkave.com window.punkave.com

boutell commented 10 years ago

Actually, I think blocking the "Other, Control" category might be an easy way to dump all the scary space-like things, including everything from \x00 to \x20.

On Sun, Jan 19, 2014 at 7:13 PM, Tom Boutell tom@punkave.com wrote:

Not bad! It lets dollar signs through, I'm not sure why. That might not be a safe character in every filesystem. It's definitely a pain when it shows up in Unix just because you have to escape carefully to touch it from the command line...

It would also let through some other garbage like nulls and tabs and newlines and carriage returns which don't seem to be in the Unicode "space" category. We should dump everything between \x00 and \x20.

Our general philosophy is that Apostrophe should just do the right thing for you, including picking good slugs and making them unique enough (which it does automatically), and that power user features shouldn't be too "in your face" for a new user adding a page for a first time. But I don't know how strongly we really feel about slug editing for new pages in particular.

On Sun, Jan 19, 2014 at 7:07 PM, Danny Povolotski < notifications@github.com> wrote:

Also, what's the reason for not displaying a slug upon creation of a page? Is it technical, ux, bug?

— Reply to this email directly or view it on GitHubhttps://github.com/punkave/apostrophe-pages/issues/6#issuecomment-32726983 .

Tom Boutell Lead Developer P'unk Avenue 215 755 1330 punkave.com window.punkave.com

Tom Boutell Lead Developer P'unk Avenue 215 755 1330 punkave.com window.punkave.com

etodanik commented 10 years ago

Alright, this looks better now: http://codepen.io/anon/pen/CEbqy

I'm gonna make a pull request from this.

etodanik commented 10 years ago

This implements it: https://github.com/punkave/apostrophe/pull/78