contao / core

Contao 3 → see contao/contao for Contao 4
GNU Lesser General Public License v3.0
491 stars 213 forks source link

Storage location of XML Sitemap should be freely determinable #8728

Open madmaharaja opened 7 years ago

madmaharaja commented 7 years ago

Currently all xml sitemaps in Contao are automatically stored in the /share folder. For multi-site installations that make use of subfolders rather than different (sub)domains this turns out to be a problem when wanting to submit a sitemap to Google Webmaster tools.

If you have the following structure: www.example.com (main page) www.example.com/location1/ (local site with it's own root page) www.example.com/location2/ etc.

All sitemaps will be stored in /share: www.example.com/share/main_sitemap.xml www.example.com/share/location1_sitemap.xml etc.

The problem If you have individual Google properties for each site and you want to submit sitemaps for each site to Google Webmaster tools the form will look as can be seen on this screenshot.

screenshot-webmaster-tools

Currently the only workaround I could think of is to set a redirect in .htaccess: RedirectPermanent location1/location1_sitemap.xml /share/location1_sitemap.xml However, I don't find this very convinient to do for every new location we add online.

My suggestion Give the option to select an alternative folder where the sitemap can be stored (e.g. see illustration) screenshot-xml-speicherort

Or another alternative: Override storing a sitemap in the standard /share folder by typing in folder + name of sitemap into the respective field:

sitemap-loesungsansatz2
fritzmg commented 7 years ago

Currently the only workaround I could think of is to set a redirect in .htaccess: RedirectPermanent location1/location1_sitemap.xml /share/location1_sitemap.xml However, I don't find this very convinient to do for every new location we add online.

I think that's sufficient. In a multidomain installation you might want to add redirects for sitemap.xml anyway, if you want search engines be able to find the correct sitemap.xml for each domain automatically, e.g.

http://www.example1.org/sitemap.xml http://www.example2.org/sitemap.xml http://www.example3.org/sitemap.xml

etc.

Also, why do you use these virtual subfolders instead of (sub)domains? How does that even work within Contao? I don't think that's a supported use case in general?

madmaharaja commented 7 years ago

Yeah, that's how I do it (redirects), it's just not very user friendly and it "blows up" the htaccess over time.

I use virtual subfolders instead of subdomains for SEO reasons. Our main domain has been around for quite a while and has earned significant trust, backlinks and "ranking power" for certain keywords. Google used to treat subdomains pretty much as individual domains (so the subdomains hardly inherit any domain authority that the main domain has earned), while URLs with subfolders under the same domain benefitted much more from the authority of the main domain. We're ranked well for keywords x, y and z -- this way the new sites pretty quickly rank well for the keywords "location1 + x, y, z". Nowadays Google says it doesn't really make a difference anymore -- however, this is the system we started out with so that's why we're still operating that way. :-)

In Contao I simply set "example.com" as domain in every root page -- the "subfolder" is determined by the alias of the root page.

frontendschlampe commented 7 years ago

check https://github.com/hofff/contao-robots-txt-editor in combination with https://github.com/hofff/contao-htaccess

There are some more problems with the sitemap:

  1. It's great to have an entry in robots.txt with the direct link to the sitemap
  2. you need a robots.txt for each domain
  3. you need redirects in htaccess to access the various robots.txt

This 3 steps we solve with the 2 extensions.

leofeyer commented 7 years ago

@madmaharaja Did you check the two extensions above?

KaiserCh commented 7 years ago

Has anybody considered that placing the sitemap.xml in a subdirectory of the webroot it is used for violates the standard? https://www.sitemaps.org/protocol.html#location

A solution like the one used in Contao always requires either a symlink or a redirect. Maybe the cross submit rule also applies to subfolders, so using a modified robots.txt would work, too. But anyway, relying on either extensions or on adminstrators actively working around things doesn't feel right.

fritzmg commented 7 years ago

So actually, /sitemap.xml should be a route that returns the appropriate sitemap depending on the domain.

Toflar commented 7 years ago

So actually, /sitemap.xml should be a route that returns the appropriate sitemap depending on the domain.

Yes. That's something we should have by default. Makes no sense to enable a sitemap by checkbox etc. We just have to make sure the correct one is output. That would be a superbe feature ;)

fritzmg commented 7 years ago

Indeed :). Also - couldn't the (appropriate) sitemap simply be generated on the fly within that route instead of going through the trouble of generating the XML files in the cron whenever there was a change? On large sites this can cause memory overflow problems and (as discussed in https://github.com/contao/check/issues/134) its generation blocks the response in Contao 4 (if you do not use php-fpm).

Toflar commented 7 years ago

Yeah, it can be generated on demand but obviously not every time it is requested. So I'd still cache it somewhere in /cache/contao/sitemaps or so (with sitemap_<root_page_id>.xml maybe?) and just deleted when needed (pages updated etc. = same routine as we already have). Would you work on something like this?

fritzmg commented 7 years ago

I would like to - unfortunately we are overbooked currently ...

Toflar commented 7 years ago

@leofeyer can you move that to contao/core-bundle please? Because it's not going to change for Contao 3.5 anyway but would be a super nice addition to any future Contao 4 version.

leofeyer commented 7 years ago

There is no need to move the ticket. Do you want me to assign it to you?

Toflar commented 7 years ago

Talked to @frontendschlampe about it, maybe they'll be working on a PR :)

frontendschlampe commented 7 years ago

I've talked to @Toflar via Mumble, because we're currently updating our hofff/contao-robots-txt-editor and hofff/contao-htaccess. If you want, we will make a PR for this:

For a website with various languages under the same domain, there will be a sitemap for every language (maybe we add the language to sitemap name) and one robots.txt with all absolute path to every sitemap. I hope, I described correctly. :-)

/cc @cliffparnitzky

leofeyer commented 7 years ago

Very good, except the "create a robots.txt file" part. We have discussed this several times and decided not to mess with user generated files.

KaiserCh commented 7 years ago

Should the URL limit per sitemap be considered? A sitemap may not contain more than 50.000 URLs. Are use cases like a huge news portal, shop (e.g. Isotope), music catalogue,... with more than 50k "objects" relevant?

Toflar commented 7 years ago

If it is a route, we do not mess with it at all :) It's sort of fallback. If you upload a robots.txt apache (or whatever server) will take this and otherwise rewrite to app.php and thus Contao 😄 It's a wonderful concept because you get a sane default without doing anything at all and if you want to, you can :)

frontendschlampe commented 7 years ago

Should the URL limit per sitemap be considered? A sitemap may not contain more than 50.000 URLs. Are use cases like a huge news portal, shop (e.g. Isotope), music catalogue,... with more than 50k "objects" relevant?

Yes ... we will do. Should we take the 50.000 URLs or less of them? Maybe 20.000?

Toflar commented 7 years ago

Google recommends to split them up (did not check how exactly) and I'm sure there's some recommendation on the threshold somewhere :)

frontendschlampe commented 7 years ago

I will check!

ghost commented 7 years ago

Sitemap is split in many files. And is built sitemap index file.

On 9/27/2017 13:25, Yanick Witschi wrote:

Google recommends to split them up (did not check how exactly) and I'm sure there's some recommendation on the threshold somewhere :)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/contao/core/issues/8728#issuecomment-332491064, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOViOym7Y4Sob9B3ZftI65bvf_zMbKDks5smjDAgaJpZM4N5JDm.

-- Sebastijan Ribaric, dipl. oec. MEDIAR, Information services Ljubljana, Slovenia www.mediar-agency.com http://www.mediar-agency.com phone: +1 718 208 4520 mobile: +386 40 130 791 skype: sebastijanribaric <skype:sebastijanribaric?chat> sebastijan@media-8.org mailto:sebastijan@media-8.org

aschempp commented 7 years ago

So I'd still cache it somewhere in /cache/contao/sitemaps or so (with sitemap_.xml maybe?) and just deleted when needed (pages updated etc. = same routine as we already have).

Please use the existing cache! The response simply needs appropriate cache headers, and everything's taken care of 😉 . No need to store the files anywhere. I've used debril/rss-atom-bundle to create something like this, though I'm not sure they support sitemap XMLs. But the principles are exactly the same.