BeaconCMS / beacon

Open-source content management system (CMS) built with Phoenix LiveView. Faster render times to boost SEO performance, even for the most content-heavy pages.
https://beaconcms.org
MIT License
1.04k stars 101 forks source link

Generate `sitemap.xml` #169

Open bcardarella opened 1 year ago

bcardarella commented 1 year ago

Generating a site map should come from Beacon.

See https://github.com/BeaconCMS/beacon/issues/169#issuecomment-1535134825 for technical details.

AZholtkevych commented 1 year ago

Depends on BeaconCMS/beacon#95

AZholtkevych commented 1 year ago

Generating the file itself is trivial but we need to research how to serve it on multiple domains

leandrocp commented 1 year ago

@AZholtkevych we can remove the dependency of BeaconCMS/beacon#95. Here are the technical details:

Each site will have its own sitemap.xml which must be resolved dynamically by adding a route /sitemap.xml to https://github.com/BeaconCMS/beacon/blob/7790eb72769a026c0bdfc3167aab394a9b73ce91/lib/beacon/router.ex#L79

A request to that route should call Beacon.Lifecycle.generate_sitemap/1 which will provide a default implementation that should work for most scenarios, which is fetching all published pages and generating the following content:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/page/path</loc>
      <lastmod>2005-01-01</lastmod>
   </url>
</urlset> 

generate_sitemap/1 should receive site as argument and call Elixir.Beacon.Config.fetch!(site).endpoint.url() to fetch current site url to be used as prefix for page paths. <lastmod> is page.updated_at

Notes

Refs

camstuart commented 4 months ago

This would be really useful.

While creating content allow users to select “exclude from sitemap” (default = included) so content can be published but not seo indexed.

I think it would need to be handled as a nested route referenced from within a static sitemap.

When setting up a phoenix project, you use your own sitemap.xml, and inside it you point to a beacon-sitemap.xml route (a plug that you include in your router scope for beacon perhaps?).

This way, a phoenix app can use beacon at the root url, or any nested scope. The project maintainer then has fine grained control over the final sitemap, and beacon can still generate the route’s in the xml dynamically for cms url’s.

The content for the sitemap can live in an ETS table in a similar manner to content pages for fast delivery.

While multiple file types are allowed with sitemaps, stick with xml. It seems to be the accepted standard.

I would be interested in doing this PR as a learning exercise once the approach has been decided.

bcardarella commented 4 months ago

We have a sitemap gen on dockyard.com but I don't think it's made it's way back to Beacon yet. Should be on the spec list for post 0.1 though

leandrocp commented 4 months ago

While creating content allow users to select “exclude from sitemap” (default = included) so content can be published but not seo indexed.

That would be a new field in https://github.com/BeaconCMS/beacon/blob/main/lib/beacon/content/page.ex and we would give it a more generic name like "index seo" or something like that because in order to skip indexing we also need to add that path into the robots.txt file so that field can be used on both features that would be implemented in a similar way.

handled as a nested route referenced from within a static sitemap [...] phoenix project, you use your own sitemap.xml, and inside it you point to a beacon-sitemap.xml

We can solve that problem by using a sitemap index file where each beacon site has its own sitemap and your phoenix project may have its own site map as well (or more), ie:

sitemap index
  sitemap site_a
  sitemap site_b
  sitemap your_phoenix_project

This way we avoid hitting the size and length limits and also have more stability by isolating each one (one site failing to generate a sitemap should not impact the others).

a plug that you include in your router scope for beacon perhaps?

I think it will make more sense to implement a router macro for a couple of reasons:

  1. Keep all routing config in the same place (sites, admin, sitemap, and maybe others eventually)
  2. You'll have access to all sites defined in the router on the compilation phase to generate the proper GET route.

So it could look like:

defmodule MyAppWeb.Router do
  ...
  use Beacon.Router
  ...

  pipeline :browser do
    ...
  end

  scope "/blog" do
    pipe_through :browser
    beacon_site "/", site: :my_site
  end

  scope "/" do
    pipe_through :browser

    beacon_sitemap_index "sitemap_index.xml", additional: ["sitemap_phoenix_app.xml"]

    beacon_site "/", site: :my_site
  end
end

In this example it would generate:

The arguments and opts of that macro may change.

The content for the sitemap can live in an ETS table

That makes sense although caching it not strictly necessary for the first version so it's up to you if you wanna implement now or in a following PR. Either way two events are emitted when a page or pages are published, see an example how they are used. A genserver could react to those events to update that ETS table.

While multiple file types are allowed with sitemaps, stick with xml. It seems to be the accepted standard.

Yep and that's the only format that supports all extensions which we don't need to implement now but eventually we'll need to support.

I would be interested in doing this PR as a learning exercise once the approach has been decided.

That's awesome. Feel free to open a draft PR to get things started or ping me at anytime.

camstuart commented 4 months ago

We have a sitemap gen on dockyard.com but I don't think it's made it's way back to Beacon yet. Should be on the spec list for post 0.1 though

Sounds interesting, should I hold off for now until it's brought into beacon?

camstuart commented 4 months ago

While creating content allow users to select “exclude from sitemap” (default = included) so content can be published but not seo indexed.

That would be a new field in https://github.com/BeaconCMS/beacon/blob/main/lib/beacon/content/page.ex and we would give it a more generic name like "index seo" or something like that because in order to skip indexing we also need to add that path into the robots.txt file so that field can be used on both features that would be implemented in a similar way.

handled as a nested route referenced from within a static sitemap [...] phoenix project, you use your own sitemap.xml, and inside it you point to a beacon-sitemap.xml

We can solve that problem by using a sitemap index file where each beacon site has its own sitemap and your phoenix project may have its own site map as well (or more), ie:

sitemap index
  sitemap site_a
  sitemap site_b
  sitemap your_phoenix_project

This way we avoid hitting the size and length limits and also have more stability by isolating each one (one site failing to generate a sitemap should not impact the others).

a plug that you include in your router scope for beacon perhaps?

I think it will make more sense to implement a router macro for a couple of reasons:

  1. Keep all routing config in the same place (sites, admin, sitemap, and maybe others eventually)
  2. You'll have access to all sites defined in the router on the compilation phase to generate the proper GET route.

So it could look like:

defmodule MyAppWeb.Router do
  ...
  use Beacon.Router
  ...

  pipeline :browser do
    ...
  end

  scope "/blog" do
    pipe_through :browser
    beacon_site "/", site: :my_site
  end

  scope "/" do
    pipe_through :browser

    beacon_sitemap_index "sitemap_index.xml", additional: ["sitemap_phoenix_app.xml"]

    beacon_site "/", site: :my_site
  end
end

In this example it would generate:

  • mydomain.com/sitemap_index.xml (because beacon_sitemap_index is in the / scope)
  • mydomain.com/sitemap.xml and mydomain.com/blog/sitemap.xml as defined in each scope
  • mydomain.com/sitemap_phoenix_app.xml

The arguments and opts of that macro may change.

The content for the sitemap can live in an ETS table

That makes sense although caching it not strictly necessary for the first version so it's up to you if you wanna implement now or in a following PR. Either way two events are emitted when a page or pages are published, see an example how they are used. A genserver could react to those events to update that ETS table.

While multiple file types are allowed with sitemaps, stick with xml. It seems to be the accepted standard.

Yep and that's the only format that supports all extensions which we don't need to implement now but eventually we'll need to support.

I would be interested in doing this PR as a learning exercise once the approach has been decided.

That's awesome. Feel free to open a draft PR to get things started or ping me at anytime.

Thanks @leandrocp! thats a very detailed reply! I have not written a macro in Elixir (or any language for that matter) but it certainly sounds like this is the correct approach.

@bcardarella mentioned the sitemap in dockyard.com, so not sure if I should wait on this? let me know if it's worth proceeding now, or waiting?

I would indeed start with a draft PR that brings all the comments and design into a single location for you to review before cutting any code.

bcardarella commented 4 months ago

I don't think we have it on the roadmap yet, it's definitely something we should port over though

camstuart commented 4 months ago

I don't think we have it on the roadmap yet, it's definitely something we should port over though

Sounds like a good idea, if there is something we can re-use / make consistent then I am all for it. I was just reviewing CONTRIBUTING.md anyway, so I will start with a good first issue to get my hands dirty, and we will see what @leandrocp suggests as a way forward

leandrocp commented 4 months ago

We can port over some of the dockyard.com sitemap implementation but not everything because either part of the code is very business specific or works in a different way we want for beacon. But let me share what we can reuse while adapting to beacon (something closer to the actual implementation we're looking for):

Starting with the router:

  pipeline :xml do
    plug :accepts, ["xml"]
  end

  scope "/", DockYardWeb do
    pipe_through :xml
    get "/sitemap_index.xml", BeaconWeb.SitemapController, :index, as: :beacon_sitemap, assigns: %{sites: [...]} # <- that would be generated by `beacon_sitemap_index/2`
                                                                                                                 # and the list of `sites` can be fetched from the router
    get "/sitemap.xml", BeaconWeb.SitemapController, :show, as: :beacon_sitemap, assigns: %{site: opts[:site]} # <- auto-generated since there's a `beacon_site "/"` in the router
                                                                                                               # and site is the same as the one passed to `beacon_site` (needed to query pages)
  end

  scope "/" do
    pipe_through [:browser]
    beacon_site "/", site: :dockyard_com # <- this is what causes `get "/sitemap.xml"` to get generated.
                                         # If we had another site `beacon_site "/blog"` it would generate `get "/blog/sitemap.xml"` (one per site)
  end

We already use assigns to render assets if you wanna see an example.

The controller now needs to render those 2 actions: index and show, something like:

defmodule BeaconWeb.SitemapController do
  @moduledoc false

  import Plug.Conn

  def init(action) when action in [:index, :show], do: action

  def call(%{assigns: %{sites: sites}} = conn, :index) do
    # render embedded sitemap_index.xml.eex
  end

  def call(%{assigns: %{site: site}} = conn, :show) do
    # render embedded sitemap.xml.eex, eg:

    conn
    |> put_resp_content_type("text/xml")
    |> put_resp_header("cache-control", "public max-age=300") # <- may need to adjust caching
    |> render(:sitemap, pages: get_pages()) # <- call Beacon.Content.list_published_pages/2
  end

Note that Beacon.Content.list_published_pages/2 already caches results so we might not need another ETS table to cache which pages are marked to be in the sitemap.

On dockyard.com we use embedded templates which would look like:

defmodule BeaconWeb.SitemapXML do
  import Phoenix.Template, only: [embed_templates: 1]
  embed_templates "sitemap/*.xml"
end

Content and format of those xml files should follow Google's recommendation and the actual protocol definition, but in short the simplest sitemap.xml.eex would be:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <%= for page <- @pages do %><url>
        <loc><%= page.loc %></loc>
        <lastmod><%= page.lastmod %></lastmod>
    </url><% end %>
</urlset>

Notes:


@camstuart that's not the easiest issue to get started with but It's available to be taken if you want :)

bcardarella commented 4 months ago

@leandrocp I'm only suggesting porting if it makes sense, if you wanted to re-think sitemap generation that's OK too. We should go with whatever implementation is best

leandrocp commented 4 months ago

@bcardarella part of it makes sense and we can reuse some code 👍🏻