getgrav / grav

Modern, Crazy Fast, Ridiculously Easy and Amazingly Powerful Flat-File CMS powered by PHP, Markdown, Twig, and Symfony
https://getgrav.org
MIT License
14.52k stars 1.41k forks source link

Blog slugs with numbers are unstable #443

Closed mirkoschubert closed 8 years ago

mirkoschubert commented 8 years ago

If I use slug: this-is-a-title.html in an item.md (e.g. for an blog post), it works perfectly fine. But if I use slugs with numbers, e.g. slug: 10-this-is-a-title.html it sometimes works, but sometimes the slug of the blog post in Grav is shown without the number or even worse, the whole blog post gets an 404 error. I use Grav 1.0.0-RC4 without Admin-Plugin and with the Blog Skeleton (Antimatter).

rhukster commented 8 years ago

So I just did a quick test of this. I added this in my page header:

slug: '2-new-slug-alias'

And I was able to reach the page via:

http://localhost/grav-demo-sampler/test/assets/2-new-slug-alias

Granted this is a bit deep in there, but the numerical slug alias did work.

Also no problem with:

http://localhost/grav-demo-sampler/test/assets/2-new-slug-alias.html

Can you check the content of your page file, by clicking on Expert mode or by looking directly at the .md file to see what the contents of the page header looks like?

rhukster commented 8 years ago

BTW, you must not include .html in the alias My guess this is the root of your problems.

rhukster commented 8 years ago

the page extension is handled internally by Grav and defaults to .html. So you will reach the same page if you access /some-page or /some-page.html

mirkoschubert commented 8 years ago

I have it for test purposes in user/pages/02.blog/sunshine-in-the-hills/item.md. If I use slug: sunshine-in-the-hills.html it works:

http://mirkoschubert.local/blog/sunshine-in-the-hills.html

With slug: 1-sunshine-in-the-hills.html I get an 404 error. So .html works, but not with numbers. .html is mandatory for me, especially for search engines (best practice: ID-some-keywords.html) ;)

rhukster commented 8 years ago

The addition of .html is not valid in Grav aliases, routes, slugs, etc.

I'm very skeptical that your best practice structure is correct. numeric IDs in URLs is not semantic, and could only hinder your search results. In fact I know CMS such as Joomla that have IDs in their URLs are actively moving away from this to allow 100% text based URLs as these are considered better for SEO.

Also in regards to your .html suffix. How can this enhance SEO? Maybe 10 years ago this was useful, but search engines are smart enough to look at headers to determine the format of the page requested. How many major sites use .html suffixes?

https://moz.com/blog/11-best-practices-for-urls

In fact if you read here (from 2006), they say keep it simple. And specifically use descriptives and not numerics - (#4).

rhukster commented 8 years ago

Oh I will throw this out too: Do a google/yahoo search for "el capitan apache php"

Notice anything interesting?

http://getgrav.org/blog/mac-os-x-apache-setup-multiple-php-versions

This page shows up in the top few search results. This is a pretty generic search query and not specific to our primary content at all, but the quality of content of this post plus Grav's SEO friendly routing and general features is ensuring this page shows up in the top few results.

Long story short, I wouldn't obsess over this ID-keywords.html page naming. You could easily setup a single simple REGEX route that routed from your old URLs to a simpler URL routing scheme. No fuss no muss! ;)

hwmaier commented 8 years ago

I totally agree with rhukster's statement here.

But just for technical clarification if it works or nor, try using instead of the slug setting the default route setting:

routes:
  default: 1-sunshine-in-the-hills.html

I am curious whether that works for you. I used the default route concept to port legacy site across to Grav while maintaining their old-fashioned .html URLs so we don't disturb existing page rankings.

mirkoschubert commented 8 years ago

@rhukster Here in Germany many major online magazines with 10 to 200 Million visits per month use this structure. Examples?

They rank exceptionally good both in Google organic search and Google News. As a journalist I worked for one of them and I learned it from them. My blog (currently with Wordpress) uses these slugs for many years. And last, but not least, I spoke with a SEO professional about my blog a few days ago, and he pointed out, that my URL structure is perfect :)

I want to leave WordPress (the security drives me nuts), so I found Grav. But those slugs are mandatory (despite some other things).

The .html part works, when there's no ID and the ID works, when there's no .html. But they obviously don't go together.

@hwmaier

Do you mean in the item.md or in the settings.yaml? I've tested the first one - no, it doesn't work. In fact, it is worse: With the slug setting at least the ID or .html are working. With the default route concept both don't work.

hwmaier commented 8 years ago

I don't think a CMS should mandate a particular URL style. There may be recommendations and best practices which @rhukster outlined above but a discussion around this can be highly opinionated. So I recommend to leave it out of the discussion and let's focus on the technical capabilities of Grav and whether Grav gives you the freedom to implement your style or not.

Btw, the default route only works on page level of course as it is a page specific setting.

I changed one of my page's header to this (if you use the default route don't use the slug):

---
routes:
  default: /123-contact.html
title:
  Contact Us

---

# Contact Us

and Grav beautifully resolves a request to http://grav.local/123-contact.html

So no issues here at all, your URL style works with Grav (tested with version RC4).

mirkoschubert commented 8 years ago

@hwmaier I don't know what I'm doing wrong... I have in user/pages/02.blog/sunshine-in-the-hills/item.md:

---
title: Sunshine in the Hills
routes:
    default: /1-sunshine-in-the-hills.html
date: 14:55 07/11/2014
author: Tasha Maxwell
taxonomy:
    category: nature
    tag: [journal, photography]
---

I flushed the cache and clicked in the blog at the post. The URL is perfect, but still: 404 :-|

hwmaier commented 8 years ago

That looks correct. Do you type the url in your browser window or do you use a page link on another page? Try entering directly into the URL.

rhukster commented 8 years ago

@mirkoschubert your page header works for me if i remove the .html however, with the .html I get an out of memory error. I think this might be due to an infinite loop because these routes are not expected to have the .html suffixes. However, without the .html in the default route config, i can still reach via .html

mirkoschubert commented 8 years ago

@hwmaier I tried both and it doesn't work both ;) @rhukster Then try this:

---
title: Sunshine in the Hills
slug: sunshine-in-the-hills.html
---

It works perfectly for me. Doesn't it disprove that the .html is the only problem? ;)

rhukster commented 8 years ago

I think it could be possible to have a system setting to enable/disable the addition of the extension (eg .html) on all page urls.

This would resolve your problem because it would add .html to all the links.

And nope, i get an out of memory error with that as the slug also. Could be that my test site just has more plugins, more custom routes, more pages, etc. I have a lot of stuff in this test setup.

The fact that it works is a bit of an accident anyway. My solution above would be a better solution and would work with everything.

hwmaier commented 8 years ago

Strange this is. Now we have 3 different results. hwmaier: route.default with .html and number works rhukster: route.default with .html enters infinite loop mirkoschubert: route.default with numbers cannot resolve

Could this be a difference in Grav versions used?

@rhukster: An extension in the route.default should work as only then you can generate links and references with the wanted extension. As mentioned before I strongly lobby to keep the support for extensions in Grav (which btw currently works) in order to support site migrations.

hwmaier commented 8 years ago

@rhukster: I don't think we need a system setting for this, as said it currently works here. A page should be able to define its own default route and I have even sites which have a mix of extensions. Some .html, some .php and some with a trailing slash. currently Grav copes with all of this if route.default is used. So a system wide setting would not work for this either.

rhukster commented 8 years ago

I think the problem i'm having is purely based on the size and complexity of my test site, and the fact that .html in the route/slug is failing and falling back into the page-not-found logic. this is incorrect but 'works' for you in your simpler setups. My out of memory stuff happens in the logic that happens after the page is not found and is looking for site-wide regex matches.

rhukster commented 8 years ago

No, like i said, it works by accident and is not reliable. It's also going to break in other places where routes are matched (probably multilang too).

rhukster commented 8 years ago

That said, could be site-wide with page override like we have used many times before. That way you can have:

Basically this covers all the bases with the minimal amount of fuss.

rhukster commented 8 years ago

However, before I look at that i'll check to see what's causing my memory issue, and maybe its fixable. Would mean you would have to set a default route/slug for EVERY page individually with no option to have it automatic for every page on your site (like joomla/wordpress does).

mirkoschubert commented 8 years ago

@hwmaier: I use Grav 1.0.0 RC4 without the Admin Plugin and the Grav Blog Skeleton/ Antimatter with all their plugins and an own (inherited) theme with the changes.

@rhukster: You're right, without the .html the page works without a 404, even if I append the .html at the url. So I could add an .html to the page.url in the links in the blog template, that the Google bot can follow it. But I would be glad to have a site-wide option to disable the automatic .html, so I can be more flexible :)

hwmaier commented 8 years ago

@mirkoschubert Did you test the URL by entering it into the browser address line? Maybe we confuse two things here. I talk about the ability to resolve URLs and not ability to generate links. and maybe the issue is the latter not the first.

hwmaier commented 8 years ago

@rhukster I have a legacy site where I have hardcoded every page with it default route. This is no issue for me as this is a special case, a legacy site so to speak. For that site I am completely bybassing the slug mechanism.

I actually even like this approach as I can keep URLs with the content and it does not depend on file/directory structure.

rhukster commented 8 years ago

Like i said, putting .html in the default route is working by accident :) We can do better.

mirkoschubert commented 8 years ago

@hwmaier I did both: I entered it in the browser address line and I followed the links on the blog summary - same result.

rhukster commented 8 years ago

I actually get a 404 like @mirkoschubert when i put .html in the slug or default route. So not sure how that even works for you @hwmaier. Doesn't work as it really shouldn't :)

hwmaier commented 8 years ago

Ah, you just reminded me that I had to set pages.types to [] for this to work in system.yaml.

pages:
   types: []  
mirkoschubert commented 8 years ago

As I mentioned, I don't get an 404 if I use the slug and only .html (with no ID) and if I use the default route and only the ID ( with no .html) ;)

rhukster commented 8 years ago

I don't think the digit ID part has anything to do with things. Works for me as expected if I use a numeric ID or not. There's no logic to discern that from a regular string.

mirkoschubert commented 8 years ago

@hwmaier Tested. Works for me as well :)

hwmaier commented 8 years ago

Guys, the pages.types is the difference. Once set to empty Grav honours the extension, otherwise it discards it for route matching.

rhukster commented 8 years ago

I know it works, but your approach could potentially break things. By removing the valid page types, your basically just skipping the step where Grav strips the extension from the route. This route is used throughout grav, and could potentially break plugins and multilang where it's expected to be a route and not include the file extension.

What i'm proposing would not change that 'workaround' you have, but would let you do this properly.

mirkoschubert commented 8 years ago

If it can be done properly, I would take this willingly :)

hwmaier commented 8 years ago

@rhukster Official support is most welcome, and it helps "site migrants" tremendously. I am happy to assist constructively and with coding if needed as I have done in the past.

rhukster commented 8 years ago

commit e96445abe343e914c85bb0226d9076846c171c2f should address this.

Basically there is now a setting in system.yaml:

pages:
  append_url_extension: '' 

By default it's an empty string, but if you provide an extension here, eg .html it will append this to all page URLs. Or you can leave it empty and provide the same value in the Page frontmatter, e.g.:

---
title: My Page
append_url_extension: '.html'
---

And it will append that for only this specific page. The .html portion of the URL is stripped as normal in Grav and routes are matched without it. So in this case, either set the slug, or the default route to /1-sunshine-in-the-hills or whatever. The URL for this page will be output as:

/1-sunshine-in-the-hills.html

You can also provide a custom extension such as:

append_url_extension: '.foo'

But you need to add foo to the list of valid page types in system.yaml, and also provide an appropriate template in your theme called: default.foo.twig or whatever the page template is.

hwmaier commented 8 years ago

Thanks for adding this so quickly, this is a welcome addition.

1) append_url_extension is quite long. Could we make it a bit shorter, maybe just extension or url_extension for example? In particular when it needs to be added on page level. 2) Why would the template have the extension to be added? If I have a file default.md with append_url_extension: .foo the template has to be called default.foo.twig not default.twig? Can't we just have the Grav standard apply were the template name is the the markdown file name less .md extension plus .twig extension?

hwmaier commented 8 years ago

In addition to the page-level append_url_extension setting, the extension setting could also be implicitly set if a slug is specified with an extension. So instead of

slug: sunshine-in-the-hills
append_url_extension: .html

this could be expressed simpler as:

slug: sunshine-in-the-hills.html

The latter could also work for a directory name sunshine-in-the-hills.html/default.md

This should be easy to do, if a slug with an extension is detected it is removed and set as append_url_extension value.

rhukster commented 8 years ago

The approach I pursued was based on minimal impact to nearly stable Grav balanced with maximum flexibility and functionality to achieve the goals needed, combined with as much ease of use as possible...

I appreciate your ideas, but there are very good reasons why each of them has issues:

1) there is already an extension on page, in most cases this is .md. So these needs to be clear it's the URL we're talking about. Also I feel that it needs to be clear what this is doing. So appending to the URL is the key to making this obvious. So I feel append_url_extension is as short as possible while still maintaining readability.

2) Grav is a flexible system that lets you choose a specific Twig template based on the extension you request. For example requesting /something.json will use the something.json.twig template. This is how Grav works and how it is able to easily support JSON, XML, RSS, Atom, and whatever output format you wish. This change doesn't effect that, it simply uses that. Your proposed change to simplify would break every theme, lots of plugins, and lose a lot of core functionality.

3) Like I said slugs are not supposed to have file extensions on them. Everything related to routes, needs to be a proper grav route, ie, start with / and not end with file extension. Having some places you can use file extensions (because grav does magic with them) and some places where you can't will only lead to user confusion. Better to be consistent.

Lastly I feel like this is a pretty niche thing. The majority of people using Grav would never use this. Those that are moving from other platforms could simply use.htaccess approaches, or even a Grav redirect. What we have left are those people that really want to maintain their URL structures 100%. Even then most people will be satisfied with the system setting option to change it globally. What we are left with is a very tiny percentage of people who are going to want to change this per page. For those few, copying and pasting one line of extra page header is not too much to ask right?

We'll just make sure to clearly document it.