Open bcardarella opened 1 year ago
Depends on BeaconCMS/beacon#95
Generating the file itself is trivial but we need to research how to serve it on multiple domains
@AZholtkevych we can remove the dependency of BeaconCMS/beacon#95. Here are the technical details:
Each site will have its own sitemap.xml which must be resolved dynamically by adding a route /sitemap.xml
to https://github.com/BeaconCMS/beacon/blob/7790eb72769a026c0bdfc3167aab394a9b73ce91/lib/beacon/router.ex#L79
A request to that route should call Beacon.Lifecycle.generate_sitemap/1
which will provide a default implementation that should work for most scenarios, which is fetching all published pages and generating the following content:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/page/path</loc>
<lastmod>2005-01-01</lastmod>
</url>
</urlset>
generate_sitemap/1
should receive site
as argument and call Elixir.Beacon.Config.fetch!(site).endpoint.url()
to fetch current site url to be used as prefix for page paths. <lastmod>
is page.updated_at
<priority>
and <changefreq>
since it's ignored by Google. We'll revisit it later on https://github.com/BeaconCMS/beacon_live_admin/issues/39, see https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap?hl=en&visit_id=638188133416054996-3427441917&rd=1#additional-notes-about-xml-sitemapsThis would be really useful.
While creating content allow users to select “exclude from sitemap” (default = included) so content can be published but not seo indexed.
I think it would need to be handled as a nested route referenced from within a static sitemap.
When setting up a phoenix project, you use your own sitemap.xml, and inside it you point to a beacon-sitemap.xml route (a plug that you include in your router scope for beacon perhaps?).
This way, a phoenix app can use beacon at the root url, or any nested scope. The project maintainer then has fine grained control over the final sitemap, and beacon can still generate the route’s in the xml dynamically for cms url’s.
The content for the sitemap can live in an ETS table in a similar manner to content pages for fast delivery.
While multiple file types are allowed with sitemaps, stick with xml. It seems to be the accepted standard.
I would be interested in doing this PR as a learning exercise once the approach has been decided.
We have a sitemap gen on dockyard.com but I don't think it's made it's way back to Beacon yet. Should be on the spec list for post 0.1 though
While creating content allow users to select “exclude from sitemap” (default = included) so content can be published but not seo indexed.
That would be a new field in https://github.com/BeaconCMS/beacon/blob/main/lib/beacon/content/page.ex and we would give it a more generic name like "index seo" or something like that because in order to skip indexing we also need to add that path into the robots.txt file so that field can be used on both features that would be implemented in a similar way.
handled as a nested route referenced from within a static sitemap [...] phoenix project, you use your own sitemap.xml, and inside it you point to a beacon-sitemap.xml
We can solve that problem by using a sitemap index file where each beacon site has its own sitemap and your phoenix project may have its own site map as well (or more), ie:
sitemap index
sitemap site_a
sitemap site_b
sitemap your_phoenix_project
This way we avoid hitting the size and length limits and also have more stability by isolating each one (one site failing to generate a sitemap should not impact the others).
a plug that you include in your router scope for beacon perhaps?
I think it will make more sense to implement a router macro for a couple of reasons:
GET
route.So it could look like:
defmodule MyAppWeb.Router do
...
use Beacon.Router
...
pipeline :browser do
...
end
scope "/blog" do
pipe_through :browser
beacon_site "/", site: :my_site
end
scope "/" do
pipe_through :browser
beacon_sitemap_index "sitemap_index.xml", additional: ["sitemap_phoenix_app.xml"]
beacon_site "/", site: :my_site
end
end
In this example it would generate:
beacon_sitemap_index
is in the /
scope)The arguments and opts of that macro may change.
The content for the sitemap can live in an ETS table
That makes sense although caching it not strictly necessary for the first version so it's up to you if you wanna implement now or in a following PR. Either way two events are emitted when a page or pages are published, see an example how they are used. A genserver could react to those events to update that ETS table.
While multiple file types are allowed with sitemaps, stick with xml. It seems to be the accepted standard.
Yep and that's the only format that supports all extensions which we don't need to implement now but eventually we'll need to support.
I would be interested in doing this PR as a learning exercise once the approach has been decided.
That's awesome. Feel free to open a draft PR to get things started or ping me at anytime.
We have a sitemap gen on dockyard.com but I don't think it's made it's way back to Beacon yet. Should be on the spec list for post 0.1 though
Sounds interesting, should I hold off for now until it's brought into beacon?
While creating content allow users to select “exclude from sitemap” (default = included) so content can be published but not seo indexed.
That would be a new field in https://github.com/BeaconCMS/beacon/blob/main/lib/beacon/content/page.ex and we would give it a more generic name like "index seo" or something like that because in order to skip indexing we also need to add that path into the robots.txt file so that field can be used on both features that would be implemented in a similar way.
handled as a nested route referenced from within a static sitemap [...] phoenix project, you use your own sitemap.xml, and inside it you point to a beacon-sitemap.xml
We can solve that problem by using a sitemap index file where each beacon site has its own sitemap and your phoenix project may have its own site map as well (or more), ie:
sitemap index sitemap site_a sitemap site_b sitemap your_phoenix_project
This way we avoid hitting the size and length limits and also have more stability by isolating each one (one site failing to generate a sitemap should not impact the others).
a plug that you include in your router scope for beacon perhaps?
I think it will make more sense to implement a router macro for a couple of reasons:
- Keep all routing config in the same place (sites, admin, sitemap, and maybe others eventually)
- You'll have access to all sites defined in the router on the compilation phase to generate the proper
GET
route.So it could look like:
defmodule MyAppWeb.Router do ... use Beacon.Router ... pipeline :browser do ... end scope "/blog" do pipe_through :browser beacon_site "/", site: :my_site end scope "/" do pipe_through :browser beacon_sitemap_index "sitemap_index.xml", additional: ["sitemap_phoenix_app.xml"] beacon_site "/", site: :my_site end end
In this example it would generate:
- mydomain.com/sitemap_index.xml (because
beacon_sitemap_index
is in the/
scope)- mydomain.com/sitemap.xml and mydomain.com/blog/sitemap.xml as defined in each scope
- mydomain.com/sitemap_phoenix_app.xml
The arguments and opts of that macro may change.
The content for the sitemap can live in an ETS table
That makes sense although caching it not strictly necessary for the first version so it's up to you if you wanna implement now or in a following PR. Either way two events are emitted when a page or pages are published, see an example how they are used. A genserver could react to those events to update that ETS table.
While multiple file types are allowed with sitemaps, stick with xml. It seems to be the accepted standard.
Yep and that's the only format that supports all extensions which we don't need to implement now but eventually we'll need to support.
I would be interested in doing this PR as a learning exercise once the approach has been decided.
That's awesome. Feel free to open a draft PR to get things started or ping me at anytime.
Thanks @leandrocp! thats a very detailed reply! I have not written a macro in Elixir (or any language for that matter) but it certainly sounds like this is the correct approach.
@bcardarella mentioned the sitemap in dockyard.com, so not sure if I should wait on this? let me know if it's worth proceeding now, or waiting?
I would indeed start with a draft PR that brings all the comments and design into a single location for you to review before cutting any code.
I don't think we have it on the roadmap yet, it's definitely something we should port over though
I don't think we have it on the roadmap yet, it's definitely something we should port over though
Sounds like a good idea, if there is something we can re-use / make consistent then I am all for it.
I was just reviewing CONTRIBUTING.md
anyway, so I will start with a good first issue
to get my hands dirty, and we will see what @leandrocp suggests as a way forward
We can port over some of the dockyard.com sitemap implementation but not everything because either part of the code is very business specific or works in a different way we want for beacon. But let me share what we can reuse while adapting to beacon (something closer to the actual implementation we're looking for):
Starting with the router:
pipeline :xml do
plug :accepts, ["xml"]
end
scope "/", DockYardWeb do
pipe_through :xml
get "/sitemap_index.xml", BeaconWeb.SitemapController, :index, as: :beacon_sitemap, assigns: %{sites: [...]} # <- that would be generated by `beacon_sitemap_index/2`
# and the list of `sites` can be fetched from the router
get "/sitemap.xml", BeaconWeb.SitemapController, :show, as: :beacon_sitemap, assigns: %{site: opts[:site]} # <- auto-generated since there's a `beacon_site "/"` in the router
# and site is the same as the one passed to `beacon_site` (needed to query pages)
end
scope "/" do
pipe_through [:browser]
beacon_site "/", site: :dockyard_com # <- this is what causes `get "/sitemap.xml"` to get generated.
# If we had another site `beacon_site "/blog"` it would generate `get "/blog/sitemap.xml"` (one per site)
end
We already use assigns to render assets if you wanna see an example.
The controller now needs to render those 2 actions: index
and show
, something like:
defmodule BeaconWeb.SitemapController do
@moduledoc false
import Plug.Conn
def init(action) when action in [:index, :show], do: action
def call(%{assigns: %{sites: sites}} = conn, :index) do
# render embedded sitemap_index.xml.eex
end
def call(%{assigns: %{site: site}} = conn, :show) do
# render embedded sitemap.xml.eex, eg:
conn
|> put_resp_content_type("text/xml")
|> put_resp_header("cache-control", "public max-age=300") # <- may need to adjust caching
|> render(:sitemap, pages: get_pages()) # <- call Beacon.Content.list_published_pages/2
end
Note that Beacon.Content.list_published_pages/2
already caches results so we might not need another ETS table to cache which pages are marked to be in the sitemap.
On dockyard.com we use embedded templates which would look like:
defmodule BeaconWeb.SitemapXML do
import Phoenix.Template, only: [embed_templates: 1]
embed_templates "sitemap/*.xml"
end
Content and format of those xml files should follow Google's recommendation and the actual protocol definition, but in short the simplest sitemap.xml.eex
would be:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<%= for page <- @pages do %><url>
<loc><%= page.loc %></loc>
<lastmod><%= page.lastmod %></lastmod>
</url><% end %>
</urlset>
Notes:
Beacon.Config.fetch!(site).endpoint
to call #{endpoint.url()}/#{page.path}
in order to generate page.loc
@camstuart that's not the easiest issue to get started with but It's available to be taken if you want :)
@leandrocp I'm only suggesting porting if it makes sense, if you wanted to re-think sitemap generation that's OK too. We should go with whatever implementation is best
@bcardarella part of it makes sense and we can reuse some code 👍🏻
Generating a site map should come from Beacon.
See https://github.com/BeaconCMS/beacon/issues/169#issuecomment-1535134825 for technical details.