facebook / docusaurus

Easy to maintain open source documentation websites.
https://docusaurus.io
MIT License
55.61k stars 8.34k forks source link

Fix trailing slash issues (depend on hosting provider) #3372

Closed slorber closed 3 years ago

slorber commented 4 years ago

Edit: as part of the analysis to solve this issue, I'm writing a guide to describe factually the behavior of various static site generators and static hosting providers: https://github.com/slorber/trailing-slash-guide

💥 Proposal

Fix trailing host slash issues: write /myDoc.html instead of /myDoc/index.html

We have a few issues that are related to docs trailing slashes: #2654, #2394, #2134, #3351, #3380

I suspect these issues are related to the Docusaurus output, as for a doc at /docs/myDoc we write /myDoc/index.html instead of /myDoc.html

Always creating a subfolder is probably a signal for the hosting providers (Netlify, Github Pages...) to add a trailing slash automatically to URLs, which can be annoying.

It can break relative links as now the link resolution is different due to the trailing slash, and we end up with broken links in production, not catched at build time (see https://github.com/facebook/docusaurus/issues/3380)

This behavior is provided by a dependency: https://github.com/markdalgleish/static-site-generator-webpack-plugin/blob/master/index.js#L165

We already run on a fork of this dependency. I guess we could as well include a way to write /myDoc.html directly if a route path does not end with a trailing slash.

As this is a risky change, I suggest making this optional. We can make it the default behavior (breaking change) later if it works fine for early adopters.

See also https://github.com/facebook/docusaurus/issues/2394

slorber commented 4 years ago

Seems related to https://github.com/markdalgleish/static-site-generator-webpack-plugin/pull/53

lunelson commented 3 years ago

Always creating a subfolder is probably a signal for the hosting providers (Netlify, Github Pages...) to add a trailing slash automatically to URLs, which can be annoying.

@slorber fwiw this is fixable on Netlify by changing your post-processing settings. They have a UI bug, where the master checkbox to disable post-processing does not disable the pretty URL setting, and that is what causes e.g. https://v2.docusaurus.io/docs/2.0.0-alpha.63/installation to 301 redirect to https://v2.docusaurus.io/docs/2.0.0-alpha.63/installation/. You have to just turn them all off individually.

image

slorber commented 3 years ago

Thanks @lunelson , didn't know about this bug, do you have a link to track this?

Also, I tested these options on my own website, but still have a trailing slash issue 😅 https://sebastienlorber.com/records-and-tuples-for-react

lunelson commented 3 years ago

@slorber If you changed the options and tested your site right away, it might not have worked: the changes seem to take a few minutes to propagate. For me the URL above now works without redirecting to a trailing slash.

And no I don't know of an issue that tracks it at Netlify but I've seen it come up in discussions here and on Gatsby about trailing slash problems.

lunelson commented 3 years ago

@slorber Having pointed that out however, I must say that in my company's websites (including 3 Docusaurus v2 sites and potentially soon more), we will probably turn this Netlify feature back on.

This is because we have had the strong recommendation from an SEO agency that—while neither format (trailing-slash or not) is "better"—there should be 301 redirects to make sure that only one of the two formats is getting crawled and indexed. And because Netlify treats these URL formats identically, and its redirects system therefore doesn't allow you to 301 from one to the other (it would create an infinite loop), our only alternative is to use this pretty URLs feature (which does create a 301) and to accept the decision which is made for us, namely that trailing-slash will be the final format.

This creates a challenge for me right now because currently Docusaurus generates all URL patterns in router/anchor links, sitemap and canonical link tags without trailing slashes 😬

mpsq commented 3 years ago

The sitemap can be generated with trailing slashes, that's what the option trailingSlash is for: https://v2.docusaurus.io/docs/using-plugins#docusaurusplugin-sitemap.

lunelson commented 3 years ago

@mpsq well, that's one down at least 😅

mpsq commented 3 years ago

Yes :) On your list, "URL patterns in router/anchor links" and "canonical link tags" are actually the same thing, they depend on how the permalink prop is generated. All you need is to enable/allow that prop to have a trailing slash.

EDIT: Maybe this could be a more general setting, rather than a granular one specific to the sitemap plugin.

lunelson commented 3 years ago

Maybe this could be a more general setting, rather than a granular one specific to the sitemap plugin.

YES. This would be ideal.

@mpsq is there any chance you could give me an idea of where I need to intervene in my custom theme, to modify that permalink property in a way that will affect the router, and generated anchor and link tags all together?

mpsq commented 3 years ago

You could do it in your theme by tweaking DocItem / Layout and replace occurrences of permalink with permalink + "/". This is super hacky though, a configuration property in core Docusaurus would be much nicer.

lunelson commented 3 years ago

a configuration property in core Docusaurus would be much nicer

Again: yes

slorber commented 3 years ago

Thanks, didn't notice but it seems my site does not have trailing slashes anymore :)

But anyway we should figure out a portable solution that works even on GithubPages where you don't have many hosting options to tweak... still believe that my initial proposal might be a better solution, need to find time to test this on multiple hosts.

For workarounds to override the permalink, you may be interested by this issue: https://github.com/facebook/docusaurus/issues/3501#issuecomment-702263920

Not sure but something like that could work to customize the permalink:

import React from 'react';
import OriginalLayout from '@theme-original/Layout';
import Head from '@docusaurus/Head';
import {useLocation} from '@docusaurus/router';

export default function Layout(props) {
  const location = useLocation();
  return (
    <>
      <OriginalLayout {...props} />
      <Head>
        <meta
          property="canonical_url"
          content={location.pathname + "/"}
        />
      </Head>

    </>
  );
}
lunelson commented 3 years ago

@slorber thanks, I'll take a look at that too. One thing about the router and this URL format issue that I haven't tested yet, is whether active link class matching will still work if the router paths are also forced to have trailing-slashes...

slorber commented 3 years ago

Notes:

We should probably add 2 new options:

We can't add a trailing / by default, not everybody is willing to have / at the end of the URLs (legacy/SEO reasons)

Another idea is to resolve all relative links at build time to absolute links, so that the presence of a trailing slash or not does not impact in any way the link target.

On Netlify, disabling the Pretty Url option prevent Netlify from adding the trailing slash, yet if user visits the page with a trailing slash, it is not removed client-side, and still potentially breaks relative links.

slorber commented 3 years ago

Until I figure how to integrate this properly in Docusaurus, here's a PR on the ReactNative website with some ideas to fix trailing slashes issues in userland: https://github.com/facebook/react-native-website/pull/2297

lunelson commented 3 years ago

I've been solving this issue mainly by conforming URLs to the desired format using a function within the Link component—since all internal links have to pass through here at some point; one complexity with this is that if you are using the baseUrl option you have to consider root-based URLs that are outside of the your sub-directory to be external, i.e. static anchors not router-links...

ukutaht commented 3 years ago

Having an option to generate /doc.html instead of /doc/index.html would be ideal for us

Morriz commented 3 years ago

Another idea is to resolve all relative links at build time to absolute links, so that the presence of a trailing slash or not does not impact in any way the link target.

This is probably the easiest, safest, and with least consequences. It is probably also the fastest! Any downsides here? It might involve replacing domain with localhost:$PORT for npm run serve but that's fine imo.

slorber commented 3 years ago

FYI this problem is related to one of our dependency (not very well maintained, we already use a fork of it) that hardcodes this /folder/index.html pattern.

https://github.com/markdalgleish/static-site-generator-webpack-plugin/blob/master/index.js#L165

I'm not 100% (as it requires checking the behavior of all the most popular static hosting providers), but maybe we could just have a post-build step that would copy /folder/index.html to /folder.html, so that both files exist in the end.

That's basically what I did on ReactNative website and it worked great (on Github Pages) without significant build time increase: https://github.com/facebook/react-native-website/pull/2297

Please try to add this site plugin and let me know if it works for your case.

I'll probably make it a separate npm package, as this tool may be useful for other SSGs too

MarkFarmiloe commented 3 years ago

In docusaurus/packages/docusaurus/src/client/normalizeLocation.ts any trailing /index.html is knocked off the path. If you knock off any trailing / as well with pathname = pathname.replace(/\/$/, ''); on line 23, this seems to fix this bug.

But I don't know whether this would break anything else. But it does seem to me that if you want to normalize, then allowing both /foo/bar/ and /foo/bar doesn't seem to be a very good job of normalizing to me.

colriot commented 3 years ago

Hi @slorber ! Are there any updates on this issue? We started to hit that on Litho website as well.

jknoxville commented 3 years ago

Just to provide more context, every time someone adds a relative link of the form [label](some-page) the resulting link is broken on the site (if you load the source page by URL - not by sidebar navigation) because the source page gets a trailing slash, and then the target is resolved to a level deeper than intended.

I'm aware that referencing file names (e.gx.mdx) works around this, but still a lot of "raw" uses slip past and go unnoticed, adding broken links.

As of time of writing, the first link on this page exhibits this behaviour: https://fblitho.com/docs/sections-tutorial#main_wrap

slorber commented 3 years ago

Just to make it clear: this issue is important for Docusaurus to solve and we'll work on it. It is just not as simple as you might think it is.

Doing a fix that works for a single site is not hard.

But doing a fix that works for ALL Docusaurus sites, using a large variety of hosting providers, all with distinct behaviors regarding how they serve static files, is complicated.

And a bad fix deployed in production can lead to unwanted side-effects that negatively affect the SEO of established documentation websites and products, so we must be sure the changes we make are safe.

Note: the Gatsby ecosystem has the exact same trailing slash problem (we have some code in common). See for example this recent blog post: https://jonsully.net/blog/trailing-slashes-and-gatsby/

For me, the source of all this is this function in an unmaintained dependency: https://github.com/markdalgleish/static-site-generator-webpack-plugin/blob/master/index.js#L165 Changing this logic is easy (fork the dep and patch it). Ensuring it does not produce unwanted side effects is time-consuming to check.


My plan is to provide a fix for trailing slash issues as an option:

Unfortunately, this won't prevent existing sites to not have issues unless they turn the option on.

If this option is successful for everyone, we could turn it on by default (the risky part)


Note: until I provide this option, the workaround I suggest here should work, and is not too far from what I plan to implement. So please try this workaround and tell us if it works or not.

agentofuser commented 3 years ago

Is there an easy workaround (i.e. that can be done on the website repo without changing docusaurus itself or dependencies) to force a trailing slash on all links and URLs?

thehappybug commented 3 years ago

Please try to add this site plugin and let me know if it works for your case.

@slorber Can you share some instructions for trying this out? So far, I've tried adding the sitePlugin.js to a src/plugins/sitePlugin.js directory and updating the Docusaurus config to add plugins: ['./src/plugins/sitePlugin'], to the root of the document but it did not effect any change. The build remains the same and no folder.html style files are created.

PS: I'm using Netlify and I'm trying to avoid the duplicate URLs issue. Currently, the site crawlers are seeing two URLs, one without slashes (present in the pages that are build by D2) and with slashes (which Netlify prefers and tries redirecting to when it sees a URL without a slash).

thehappybug commented 3 years ago

This is because we have had the strong recommendation from an SEO agency that—while neither format (trailing-slash or not) is "better"—there should be 301 redirects to make sure that only one of the two formats is getting crawled and indexed. And because Netlify treats these URL formats identically, and its redirects system therefore doesn't allow you to 301 from one to the other (it would create an infinite loop), our only alternative is to use this pretty URLs feature (which does create a 301) and to accept the decision which is made for us, namely that trailing-slash will be the final format.

I double-down on what @lunelson said. For users who are using Netlify, we need the final build to be using trailing-slashes as that decision is made for us, and turning on/off the "Pretty URLs" setting does not help.

slorber commented 3 years ago

Is there an easy workaround (i.e. that can be done on the website repo without changing docusaurus itself or dependencies) to force a trailing slash on all links and URLs?

@agentofuser not available now, but under consideration

slorber commented 3 years ago

@thehappybug are you even sure the plugin runs in the first place?

Have you tried adding logs so that you know which files are created? Note you can also run your own script with node after yarn build, it does not necessarily have to be a Docusaurus plugin if it complicated your life.

I double-down on what @lunelson said. For users who are using Netlify, we need the final build to be using trailing-slashes as that decision is made for us, and turning on/off the "Pretty URLs" setting does not help.

I claim another time that this is false: disabling pretty URLs works reliably on Netlify and there won't be any redirect. VERY IMPORTANT: the global "disable asset processing" checkbox is broken and does not really disable pretty URLs: you have to uncheck it.

Proof: as part of my study to solve this issue, I'm writing a trailing slash guide for the community. There's a live Netlify deployment with pretty URLs disable on which you can see for yourself there is no server-side redirect: https://github.com/slorber/trailing-slash-guide#netlify

slorber commented 3 years ago

@lunelson

This is because we have had the strong recommendation from an SEO agency that—while neither format (trailing-slash or not) is "better"—there should be 301 redirects to make sure that only one of the two formats is getting crawled and indexed. And because Netlify treats these URL formats identically, and its redirects system therefore doesn't allow you to 301 from one to the other (it would create an infinite loop), our only alternative is to use this pretty URLs feature (which does create a 301) and to accept the decision which is made for us, namely that trailing-slash will be the final format.

I'm not an SEO expert but I believe having 301 redirects is not mandatory if you have canonical URLs and you explain to the crawlers that the 2 URLs are the same page. Docusaurus sites have canonical URLs by default and crawlers shouldn't penalize your site for publishing duplicated content

slorber commented 3 years ago

I've opened a draft PR to propose a solution: https://github.com/facebook/docusaurus/pull/4908

thehappybug commented 3 years ago

Thank you for your responses, @slorber. The trailing slash guide was quite useful.

Have you tried adding logs so that you know which files are created? Note you can also run your own script with node after yarn build, it does not necessarily have to be a Docusaurus plugin if it complicated your life.

I managed to get your solution working. I had previously made a mistake in registering plugins.

For now, I'm not going forwards with deploying this solution because:

  1. With Pretty URLs disabled, Netlify is serving pages at both slug.html, slug/ and slug. Even though the canonical URL should prevent any impact on SEO (I am not an expert but this is what I gleaned from the comments), I would prefer that Netlify's routing did not serve the pages at multiple URLs. Using Pretty URLs results in a cleaner routing that sits better with the web developers in my team (🤷‍♂️).
  2. A previous Hugo-based version of our docs site used trailing slashes in the URLs. This worked well with Netlify's Pretty-URLs (we had assets optimization disabled but as you pointed out, the Pretty URLs feature still kicks in). When we migrated to D2, the sitemap suddenly had URLs without trailing slashes in them but the Pretty URLs feature still redirected users to the URLs with trailing slashes. This confused the Google crawler at least. You can see from the screenshot what I mean: image If you drill down into what happened, you see that each ignored page has this issue: image While I don't see our search performance going down as of now, my team is quite wary of continuing the site in this state.
  3. The solution you suggested copies the HTML pages outside the directories. With Pretty URLs, these pages would turn from slug.html to just slug. Since I'm stuck with having trailing slashes in a previous version of the site and Netlify having permanently redirected crawlers to the URLs with trailing slashes, the solution doesn't help me. I wait for an option in D2 that helps me control trailing slashes so that I can put trailing slashes consistently in all places, in all pages, and sitemap.

I hope this feedback helps. Let me know if I can help you test any future releases related to this issue.

slorber commented 3 years ago

Thanks for the feedback, I understand why you would want to keep the exact same URLs you had before, and I'll make that possible.

As soon as my PR is merged, you'd be able to use the @canary npm dist tag to give it a try (soon).

slorber commented 3 years ago

As part of https://github.com/facebook/docusaurus/pull/4908

I'm implementing a new {trailingSlash: boolean | undefined} config.

Refer to https://github.com/slorber/trailing-slash-guide for how the files will be served by your host.

I've created 3 deployments to test this feature:

Note: using true/false instead of undefined allows to use Netlify with pretty URLs on without having any annoying redirection, but it's not the case for the undefined behavior that will keep redirecting with pretty URLs on (default behavior): I disabled pretty URLs for undefined on purpose.

PLEASE: help me review those deployments and let me know if you see any unwanted side-effects: now is a better time to complain than after merging the PR :)

slorber commented 3 years ago

The PR has been merged, and you can try this right now with the @canary dist tag or wait for beta.1

https://www.npmjs.com/package/@docusaurus/core?activeTab=versions

2.0.0-beta.df8a900f9

Please let me know if anything does not work asap.

Also, in general, can you please let me know:

According to your feedbacks, we may update the doc and recommend a trailingSlash setting for each host