ProjectEvergreen / greenwood

Greenwood is your full-stack workbench for the web, focused on supporting modern web standards and development to help you create your next project.
https://www.greenwoodjs.io
MIT License
94 stars 9 forks source link

Sitemap Generation #1232

Open thescientist13 opened 1 month ago

thescientist13 commented 1 month ago

Summary

Called out in our Slack channel, but Greenwood should definitely have some support for sitemaps, which are an XML file used to tell Search Engines about the content and pages contained within a site, in particular for larger sites and / or where links between pages are maybe not as consistent. https://developers.google.com/search/docs/crawling-indexing/sitemaps/overview

A sitemap tells search engines which pages and files you think are important in your site, and also provides valuable information about these files. For example, when the page was last updated and any alternate language versions of the page.

Here is a basic example https://developers.google.com/search/docs/crawling-indexing/sitemaps/build-sitemap

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.example.com/foo.html</loc>
    <lastmod>2022-06-04</lastmod>
  </url>
</urlset>

Details

I think the approach used in Next.js is probably good enough for Greenwood supporting either of this options

  1. ✅ Static File, e.g. sitemap.xml - will be copied automatically to the output
  2. Dynamic File, e.g. sitemap.xml.js - will be provided a copy of the greenwood graph and be expected to return valid XML

    export async function sitemap(compilation) {
      const urls = compilation.graph.map((page) => {
        return `
          <url>
            <loc>http://www.example.com${page.route}</loc>
          </url>
        `;
      });
    
      return `
        <?xml version="1.0" encoding="UTF-8"?>
        <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
          ${urls}
        </urlset>
      `
    }

Might want to wait until after #955 is merged since we might want to piggy back off any solutions there re: extending the ability for pages to be more than just markdown (.md) or JavaScript (.js).

thescientist13 commented 1 month ago

For now a couple ways to implement this manually could be to:

  1. Create / maintain a src/sitemap.xml and use a copy plugin to put into the output directory
  2. After the greenwood build step, read the contents of graph.json in the output directory and generate the file
jstockdi commented 1 month ago

For 2, would it be a copy plugin? ie, the plugin would generate a temporary file, then pass

     {
        from: tempPath,
        to: new URL(`sitemap.xml`, outputDir)
     }
thescientist13 commented 1 month ago

@jstockdi Greenwood should automatically generate a graph.json file for you, that will be available in the output directory after running greenwood build (it's technically there too during development in the .greenwood/ tmp folder)

So after running greenwood build, a simple Node script should suffice

// sitemap-gen.js
import fs from 'fs';
import graph from './public/graph.json' with { type: 'json'};

const urls = graph.map((page) => {
  return `
    <url>
      <loc>http://www.example.com${page.route}</loc>
    </url>
  `
}).join('\n');

fs.writeFileSync('./public/sitemap.xml', `
  <?xml version="1.0" encoding="UTF-8"?>
  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ${urls}
  </urlset>
`);
# after running Greenwood build, or add to your npm scripts...
$ node sitemap-gen.js

edit: sorry, I think you were referencing option 1, in which case yes, a copy plugin would do the trick, e.g.

function myCopySitemapPlugin() {
  return {
    type: 'copy',
    name: 'plugin-copy-sitemap',
    provider: (compilation) => {
      const filename = 'sitemap.xml';
      const { userWorkspace, outputDir } = compilation.context;

      return [{
        from: new URL('./${filename}', userWorkspace),
        to: new URL('./${filename}', outputDir)
      }];
    }
  };
}

Otherwise, to generate dynamically for now, the above script sample should also work. 🎯

jstockdi commented 1 month ago

Actually, I was thinking use a copy plugin...

Read the graph, write a dynamic file to scratch, then copy to final.

const greenwoodPluginSitemap = [{
    type: 'copy',
    name: 'plugin-copy-sitemap',
    provider: async (compilation) => {

      const { outputDir, scratchDir } = compilation.context;

      const urls = graph.map((page) => {
        return `
          <url>
            <loc>http://www.example.com${page.route}</loc>
          </url>
        `
      }).join('\n');

      const sitemapFromUrl = new URL(`./sitemap.xml`, scratchDir)
      fs.writeFileSync(
        sitemapFromUrl, `
        <?xml version="1.0" encoding="UTF-8"?>
        <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
          ${urls}
        </urlset>
      `);

      const assets = [];

      assets.push({
        from: sitemapFromUrl,
        to: new URL(`./${fileName}`, outputDir)
      });

      return assets;      
    }
  }];
thescientist13 commented 1 month ago

So for the two different options here from a contributing perspective, here are my initial thoughts

Static Sitemap

For a static sitemap in the root workspace folder, e.g. src/sitemap.xml it should just be as simple as following one of the existing "copy" based features / plugins, like our robots.txt plugin https://github.com/ProjectEvergreen/greenwood/blob/master/packages/cli/src/plugins/copy/plugin-copy-robots.js

Dynamic Sitemap

As for supporting a dynamic flavor of this, e.g. src/sitemap.xml.js I'm not sure I have an idea on the best way to instrument this off the top of my head, mainly for handling development vs production workflows which are slightly different.

For development, we could make a resource plugin that resource plugin that has a serve lifecycle that checks if the dynamic flavor exists in shouldServe and then the serve function would be something like this?

async function shouldServe(url) {
  return url.pathname.endsWith('sitemap.xml.js')
}

async function serve(url) {
  const { generateSitemap } = (await import(url)).then(module => module);
  const sitemap = await generateSitemap(this.compilation);

  return new Response(sitemap, { headers: { 'Content-Type': 'text/xml' });
}

For production, we could probably just run that similar logic in serve (except just outputting a file instead of returning a Response object) in the bundle command.

Testing

Greenwood tests are basically black box tests, You can create an exact version of any greenwood project + config, run the CLI, and just the output, in either case, that a sitemap.xml file is generated in the output folder. https://github.com/ProjectEvergreen/greenwood/tree/master/packages/cli/test/cases

We would probably want on test case for each of static and dynamic sitemaps

Documentation

I think for now the best place to document these would probably be in the Styles and Assets page