Open dpancic opened 4 years ago
In GitLab by @stefanprobst on Jun 23, 2020, 09:29
fixed by d7a19a95
In GitLab by @stefanprobst on Jun 23, 2020, 09:29
closed
In GitLab by @KlausIllmayer on May 20, 2021, 15:45
We found out yesterday that the development instance is visible in Google Search results. Google gives the information, that disallowing search engines in the robots.txt does not prevent showing the site url in the Google search results (but it will not show a description of the site), instead it would be necessary to turn off the robots.txt disallow rules and use noindex
either in HTTP header or as a meta-tag (found this information: https://developers.google.com/search/docs/advanced/crawling/block-indexing). I'm not sure if other search engines follow the same rules.
Asking @vronk @laureD19 @vronk @stefanprobst if we should apply this Google rule or if we leave it as it is. I opt for leaving as it is.
In GitLab by @stefanprobst on May 20, 2021, 15:51
do you have a screenshot or example search?
In GitLab by @vronk on May 20, 2021, 15:52
Hm, interesting. But if the actual content is not indexed, just the site url, then ok.
Speaking of which, the robots.txt on our production server at https://marketplace.sshopencloud.eu/robots.txt has the same Disallow rule – I expect this will stay the same while we're in Beta, right? (We just shouldn't forget to change that rule in the final. 😊)
In GitLab by @KlausIllmayer on May 20, 2021, 15:53
not so easy without exposing the url to the public via gitlab ;) if i search in google for "sshoc marketplace" it shows me on the second page the development version (could be different depending on your search history - but try a private window)
In GitLab by @stefanprobst on May 20, 2021, 15:55
for me this is the 15th match (in a private window):
In GitLab by @stefanprobst on May 20, 2021, 15:58
also fyi: we do define a sitemap in robots.txt which always points to the prod instance, but shouldn't be indexed because of the disallow rule.
i don't think it's a huge deal to have the url in the results - especially since this will rank lower as soon as we have the final release live with proper canonical urls.
@vronk yes, the plan was to remove the disallow rule on final release.
We liked to have a robots.txt for the final release (see the last comment) but it seems, that it does not exist: https://marketplace.sshopencloud.eu/robots.txt gives a 404. @stefanprobst can we integrate a robots.txt for the production instance?
Had a talk with Stefan: it is not so much about the robots.txt as the main motivation behind this is a better SEO (search engine optimization) result. And we thought, that a robots.txt may help for this. But it seems we need to dig deeper into this. The current state is very disappointing. Looking in google for all results from the marketplace (enter in the search bar of google site:marketplace.sshopencloud.eu
and search) only returns 112 results. Static websites seems to be indexed but all dynamic content (= items) is either missing or quite old (from 2020). It is unclear, why we are in this state. Interestingly, it is quite the same for duckduckgo and bing (but bing having a little bit more results: 636 results)
@laureD19 Stefan proposed to hand over the current connection to googles webmaster analysis tool to a DARIAH-Account. Do you know, if there is such an account? We also may need to look a little bit deeper in SEO to understand, why marketplace is so bad covered.
Section 3.2 of https://arxiv.org/ftp/arxiv/papers/1706/1706.05089.pdf might be useful. Google's indexing strategy however often remains a black box.
On 2023-08-21 16:44, KlausIllmayer wrote:
Had a talk with Stefan: it is not so much about the robots.txt as the main motivation behind this is a better SEO (search engine optimization) result. And we thought, that a robots.txt may help for this. But it seems we need to dig deeper into this. The current state is very disappointing. Looking in google for all results from the marketplace (enter in the search bar of google |site:marketplace.sshopencloud.eu| and search) only returns 112 results. Static websites seems to be indexed but all dynamic content (= items) is either missing or quite old (from 2020). It is unclear, why we are in this state. Interestingly, it is quite the same for duckduckgo and bing (but bing having a little bit more results: 636 results)
@laureD19 https://github.com/laureD19 Stefan proposed to hand over the current connection to googles webmaster analysis tool to a DARIAH-Account. Do you know, if there is such an account? We also may need to look a little bit deeper in SEO to understand, why marketplace is so bad covered.
— Reply to this email directly, view it on GitHub https://github.com/SSHOC/sshoc-marketplace-frontend/issues/26#issuecomment-1686469298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5TCDYUSIKUUI37Y6SBOB3XWNX5FANCNFSM6AAAAAA3INTG34. You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Dieter Van Uytvanck Technical Director CLARIN ERIC www.clarin.eu | tel. +31-(0)850091363 | skype: dietervu.mpi
Thanks for the pointer! Indeed we observe something similar, and yes, we may need to invest into creating a sitemap.
move the registration of SSHOMP in the google search console to dariah-eric.eu (managed by Arnaud and Matej), currently registered by Stefan.
in the next step: generate a dynamic sitemap and feed it to google.
In GitLab by @KlausIllmayer on Jun 22, 2020, 18:02
As it is alpha release we don't like to have the SSHOC MP showing up in search engines. Therefore a robots.txt should be put into the production frontend that disallows every bot. Blocking will be lifted likely with Beta release.