MbinOrg / mbin

Mbin: a federated content aggregator, voting, discussion and microblogging platform (By the community, for the community)
https://joinmbin.org
GNU Affero General Public License v3.0
180 stars 17 forks source link

Add sitemap.xml #524

Open kreynen opened 4 months ago

kreynen commented 4 months ago

Is your feature request related to a problem? Please describe.

When searching for something link https://www.google.com/search?q=drupal+reservation+systems, users will often find links to Reddit ranked relatively high in the results.

Screenshot 2024-02-26 at 8 25 12 AM

Google isn't using https://www.reddit.com/sitemap.xml to find new Reddit posts. Google is treating Reddit differently than the rest of the semantic web... and will continue to do that with deals like https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/.

For a new community/mbin instance to compete with an existing reddit community, it has to be discoverable outside of ActivityPub clients.

Describe the solution you'd like

Adding a sitemap.xml that lists the magazines and collections on an instance is one way to improve how quickly Google and other search engines find and index content. My recommendation is to provide this as an option magazines can opt into. The root level sitemap.xml of the instance would be a sitemap xml index of the local magazines that choose to generate a sitemap.xml.

The Magazine level sitemap.xml would include the details of threads posted.

Ignoring the fact that https://kbin.social/m/drupal is hosted on kbin.social for the moment... if https://kbin.social/m/drupal was the only magazine that opted in, the root level sitemap.xml file at https://kbin.social/sitemap.xml would look like...

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://kbin.social/m/drupal/sitemap.xml</loc>
  </sitemap>
</sitemapindex>

The magazine level sitemap.xml at https://kbin.social/m/drupal/sitemap.xml would look like...

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://kbin.social/m/drupal/t/814091/Following-Kbin-communities-from-Mastodon-is-as-easy-as-searching</loc>
    <lastmod>2024-02-04</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://kbin.social/m/drupal/t/860608/Ways-to-Optimize-Carousel-Sliders-in-Drupal-for-Faster-Page</loc>
    <lastmod>2024-02-26</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.1</priority>
  </url>
  <url>
    <loc>https://kbin.social/m/drupal/t/855307/The-Essential-Drupal-Commerce-Modules-for-building-Online-Stores</loc>
    <lastmod>2024-02-24</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.5</priority>
  </url>
</urlset>

The priority for each magazine could be calculated using pinned and votes. Changefreq would be based on replies and voting in that thread.

Describe alternatives you've considered

My interest in this request for a very specific use case, but when I started looking into this I found someone else had already opened the feature request in in https://codeberg.org/Kbin/kbin-core/issues/1305. I started looking into some of the options for generating sitemap.xml files with modern PHP/Symfony, but never got a response from the KBin community on which direction would align with the project's architecture... so now I'm asking the same questions here.

https://keeplearning.dev/generate-sitemap-in-symfony-6-6068c37225 gives a good, high-level overview of bundle vs. custom controller approaches. I know nothing about these bundles or the Mbin project's preferred approach to a feature like this, but I'm willing to volunteer a few cycles to move this forward if someone more familiar with the project is willing to point me in the right direction.

While I think I could get all the information I need to generate the sitemap.xml from instances that have the API enabled like https://kbin.melroy.org/api/magazines?p=1&perPage=48&sort=hot&federation=local&hide_adult=hide and https://kbin.melroy.org/api/magazine/25/entries?sort=hot&time=%E2%88%9E&p=1&perPage=25&usePreferredLangs=false and generate the files with a service outside the MBin codebase, that's a really inefficient way to generate those files on a low traffic instance.

Additional context

If someone points me in the right direction, I'm happy to take a stab at this.

BentiGorlich commented 4 months ago

Can you please fill out the template for a feature request and edit yours accordingly? And add the information from the original proposal?

As per your request, I think we need to have useful privacy options before we talk about an xml file that just contains pointers to everything from an instance. Additionally I am skeptical whether this is a good thing in the first place. In either way, I think that comments should not be present in the sitemap at all (not in the proposal, just wanted to say it)

kreynen commented 4 months ago

I updated the formatting. I'm curious about why you are skeptical about using an open standard for defining content location, priority and the frequency that the content is updated? The lack of a sitemap.xml does not determine whether the content is indexed or not.

If you search https://www.google.com/search?q=kreynen+drupal and scroll down into the results, you will eventually find Kbin, Reddit and Mastodon posts. If it's public, Google will index it. This feature would give instance owners the option of influencing how often Google is indexing specific content from the instance.

Screenshot 2024-02-26 at 10 14 39 AM
BentiGorlich commented 4 months ago

I think my hesitation comes from not really knowing a lot about it and making it a lot easier for everybody to find things they are not supposed to find. So I don't have a good reason for blocking it, cause security by obscurity is not security... Just 2 hints: Lemmy has a sitemap, though not a very extensive one, Mastodon does not

kreynen commented 4 months ago

As I'm sure you are aware, it's not a great idea to rely on obscurity for security. You can't even rely on bots to respect a robots.txt. If something is available without authentication to HTTP requests, assume it will eventually show up in a Google search.

Google has a special relationship with large projects. If you scan a Drupal or WordPress site with https://pagespeed.web.dev/, you will get Drupal or WordPress specific suggestions to improve the page performance... which reduces Google's cost to index the content.

Adding a Sitemap.xml has come up in Mastodon too. https://github.com/mastodon/mastodon/issues/11959 points to a Python project that can generate a sitemap.xml for a Mastodon instance that uses a similar approach to what I was describing doing with the KBin/MBin API.

I'm going to share more about why we want this feature in Matrix.

asdfzdfj commented 4 months ago

my 2c braindump on this:

my rational here is that you could setup an instance where only a handful of magazines would be getting sitemaps index, or an instance where it's meant/intended to be seen and indexed, but then maybe exclude some magazines from this indexing, like those about meta discussion/reports about the instance itself or general lobby magazine, if they are interested in wanting this to be exposed at all, otherwise it shouldn't be generating anything if the instance doesn't want to be easily seen/indexed by search engine and other bots/tools.

also, ideally the code should be authored by the contributor (i.e. you, if you want to submit patches), but depending on how sitemap generation is done and how expensive it could be perhaps enlisting help from an external bundle might be a decent choice (note that I'm mostly going off presta/sitemap-bundle for now since the other one appears to be archived, but I'm quite interested in dumped sitemaps functionality that could be periodically updated, if live sitemap generation by custom controller could become an expensive operation)

in any case, feel free to make a fork copy and do some experiments in the meantime, and maybe make a PR/propose the patch if/when you feel like you've got something?

BentiGorlich commented 4 months ago

I agree with

Additionally I would add

I think we should definitely leveeage the scheduler component for this. We didn't yet build the framework to just use it, but I wanted to include it anyways

asdfzdfj commented 4 months ago
  • give magazines options (though I am for opt out in this case, but a configurable global default is good as well)

at first I also thought of this mode for magazine, but I decided on configurable defaults for easy allowlist/blocklist mode of operation when sitemap generation is active

  • include users who opt in

that'd be good too, but I didn't mention this since adding option for (local) is easy, but I have no idea on how to best enforce these for remote users posting to the local magazine

melroy89 commented 4 months ago

Adding a Sitemap.xml has come up in Mastodon too. mastodon/mastodon#11959 points to a Python project that can generate a sitemap.xml for a Mastodon instance that uses a similar approach to what I was describing doing with the KBin/MBin API.

My 2c are: