Open kreynen opened 9 months ago
Can you please fill out the template for a feature request and edit yours accordingly? And add the information from the original proposal?
As per your request, I think we need to have useful privacy options before we talk about an xml file that just contains pointers to everything from an instance. Additionally I am skeptical whether this is a good thing in the first place. In either way, I think that comments should not be present in the sitemap at all (not in the proposal, just wanted to say it)
I updated the formatting. I'm curious about why you are skeptical about using an open standard for defining content location, priority and the frequency that the content is updated? The lack of a sitemap.xml does not determine whether the content is indexed or not.
If you search https://www.google.com/search?q=kreynen+drupal and scroll down into the results, you will eventually find Kbin, Reddit and Mastodon posts. If it's public, Google will index it. This feature would give instance owners the option of influencing how often Google is indexing specific content from the instance.
I think my hesitation comes from not really knowing a lot about it and making it a lot easier for everybody to find things they are not supposed to find. So I don't have a good reason for blocking it, cause security by obscurity is not security... Just 2 hints: Lemmy has a sitemap, though not a very extensive one, Mastodon does not
As I'm sure you are aware, it's not a great idea to rely on obscurity for security. You can't even rely on bots to respect a robots.txt. If something is available without authentication to HTTP requests, assume it will eventually show up in a Google search.
Google has a special relationship with large projects. If you scan a Drupal or WordPress site with https://pagespeed.web.dev/, you will get Drupal or WordPress specific suggestions to improve the page performance... which reduces Google's cost to index the content.
Adding a Sitemap.xml has come up in Mastodon too. https://github.com/mastodon/mastodon/issues/11959 points to a Python project that can generate a sitemap.xml for a Mastodon instance that uses a similar approach to what I was describing doing with the KBin/MBin API.
I'm going to share more about why we want this feature in Matrix.
my 2c braindump on this:
random
should also be excluded from sitemap generation, despite the instance defaultsmy rational here is that you could setup an instance where only a handful of magazines would be getting sitemaps index, or an instance where it's meant/intended to be seen and indexed, but then maybe exclude some magazines from this indexing, like those about meta discussion/reports about the instance itself or general lobby magazine, if they are interested in wanting this to be exposed at all, otherwise it shouldn't be generating anything if the instance doesn't want to be easily seen/indexed by search engine and other bots/tools.
also, ideally the code should be authored by the contributor (i.e. you, if you want to submit patches), but depending on how sitemap generation is done and how expensive it could be perhaps enlisting help from an external bundle might be a decent choice (note that I'm mostly going off presta/sitemap-bundle
for now since the other one appears to be archived, but I'm quite interested in dumped sitemaps functionality that could be periodically updated, if live sitemap generation by custom controller could become an expensive operation)
in any case, feel free to make a fork copy and do some experiments in the meantime, and maybe make a PR/propose the patch if/when you feel like you've got something?
I agree with
Additionally I would add
I think we should definitely leveeage the scheduler component for this. We didn't yet build the framework to just use it, but I wanted to include it anyways
- give magazines options (though I am for opt out in this case, but a configurable global default is good as well)
at first I also thought of this mode for magazine, but I decided on configurable defaults for easy allowlist/blocklist mode of operation when sitemap generation is active
- include users who opt in
that'd be good too, but I didn't mention this since adding option for (local) is easy, but I have no idea on how to best enforce these for remote users posting to the local magazine
Adding a Sitemap.xml has come up in Mastodon too. mastodon/mastodon#11959 points to a Python project that can generate a sitemap.xml for a Mastodon instance that uses a similar approach to what I was describing doing with the KBin/MBin API.
My 2c are:
All of the above already mentioned by @asdfzdfj & @BentiGorlich
Try to add the (generated) sitemap on the root path (/sitemap.xml
). And point to other sitemaps from there if needed.
Do NOT use our APIs for creating a sitemap.xml
. Like you said, it's not the most efficient way. If you want to generate a sitemap, use PHP just like the rest of the project and you can leverage internal methods to retrieve only data you really need. You can also write dedicated queries/DTO to retrieve data from the database.
Cache the sitemap.xml internally for a certain period of time, so if I would call the sitemap.xml 10 times after each other it has no impact. Do not re-generate the sitemap every-time from scratch. This will cause most likely too much load and unnecessary resources from the server-side otherwise.
Limit the max. results on the (sub) sitemaps.xml. Eg. limit in the amount of records retrieved (eg. a hard DB limit) and/or in time (eg. not more than several months/years back?). This will improve performance and also makes it more relevant for search engines.
We are not fully SEO optimized. Meaning sitemap.xml is a good start (I also generated them for my site, like my blog), but most likely there are other next steps to improve getting indexed by search engines like Google. Just saying, this will be out of scope for now of course.
Is your feature request related to a problem? Please describe.
When searching for something link https://www.google.com/search?q=drupal+reservation+systems, users will often find links to Reddit ranked relatively high in the results.
Google isn't using https://www.reddit.com/sitemap.xml to find new Reddit posts. Google is treating Reddit differently than the rest of the semantic web... and will continue to do that with deals like https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/.
For a new community/mbin instance to compete with an existing reddit community, it has to be discoverable outside of ActivityPub clients.
Describe the solution you'd like
Adding a sitemap.xml that lists the magazines and collections on an instance is one way to improve how quickly Google and other search engines find and index content. My recommendation is to provide this as an option magazines can opt into. The root level sitemap.xml of the instance would be a sitemap xml index of the local magazines that choose to generate a sitemap.xml.
The Magazine level sitemap.xml would include the details of threads posted.
Ignoring the fact that https://kbin.social/m/drupal is hosted on kbin.social for the moment... if https://kbin.social/m/drupal was the only magazine that opted in, the root level sitemap.xml file at https://kbin.social/sitemap.xml would look like...
The magazine level sitemap.xml at https://kbin.social/m/drupal/sitemap.xml would look like...
The priority for each magazine could be calculated using pinned and votes. Changefreq would be based on replies and voting in that thread.
Describe alternatives you've considered
My interest in this request for a very specific use case, but when I started looking into this I found someone else had already opened the feature request in in https://codeberg.org/Kbin/kbin-core/issues/1305. I started looking into some of the options for generating sitemap.xml files with modern PHP/Symfony, but never got a response from the KBin community on which direction would align with the project's architecture... so now I'm asking the same questions here.
https://keeplearning.dev/generate-sitemap-in-symfony-6-6068c37225 gives a good, high-level overview of bundle vs. custom controller approaches. I know nothing about these bundles or the Mbin project's preferred approach to a feature like this, but I'm willing to volunteer a few cycles to move this forward if someone more familiar with the project is willing to point me in the right direction.
While I think I could get all the information I need to generate the sitemap.xml from instances that have the API enabled like https://kbin.melroy.org/api/magazines?p=1&perPage=48&sort=hot&federation=local&hide_adult=hide and https://kbin.melroy.org/api/magazine/25/entries?sort=hot&time=%E2%88%9E&p=1&perPage=25&usePreferredLangs=false and generate the files with a service outside the MBin codebase, that's a really inefficient way to generate those files on a low traffic instance.
Additional context
If someone points me in the right direction, I'm happy to take a stab at this.