Submit sitemap for Google data search

cessda-bitbucket-importer commented 5 years ago

Original report on BitBucket by John Shepherdson (GitHub: john-shepherdson).

“Please ensure that you have submitted your sitemap to the search console to have our Googlebot begin crawling your pages for markup. I have included documentation on how to do this as reference: https://support.google.com/webmasters/answer/183668?hl=en. Please note that it may take some time before your data will appear on the feature.

In addition, as the feature has officially launched, it would be best that any further questions or concerns that you have are posted on our webmaster forums where you will have a wealth of information and resources to answer any potential questions you may have. I have included a link to the forum below for your reference: https://support.google.com/webmasters/community “

Structured data testing tool:

https://search.google.com/structured-data/testing-tool

cessda-bitbucket-importer commented 5 years ago

Original comment by Ashley Fox.

Regarding structured data, this should be working and will be picked up once Google indexes the site.

As for indexing, this is unfortunately not going to be straight forward. The CESSDA Data Catalogue is not a website which can be easily indexed, it is a search engine. You are asking one search engine (Google) to index another (Data Catalogue).

The UI does not link to every possible indexed study in its markup, nor would you want to with many thousands of records. This means they cannot be picked up by the Google index crawler. Google will only index our landing page and a handful of records which load by default with an empty search.

If you would like Google to index every single study, we will need to handle this outside the UI, generating multiple XML sitemaps which are linked from a robots.txt file in the website root directory. I say multiple since we presumably will have thousands of records and there is a limit of 50,000 links and 50MB per sitemap (see limits and splitting sitemaps in their documentation).

This would work however the sitemaps are static files so you would need to automate the process of generating them whenever new records are added/modified/removed from the Elasticsearch indices. You would also want to make sure any deployment scripts don't erase sitemap files when new builds are released to production etc.

@‌jws_mo I am happy to help if needed but this falls outside the scope of UI and would involve backend changes to Elasticsearch to get it to generate the required sitemaps. We'll need input from @doraVentures as well as those in charge of deployment/automation (how do we get those sitemap files into the website root directory etc.)

cessda-bitbucket-importer commented 5 years ago

Original comment by Ashley Fox.

Awaiting feedback

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

Why is nothing as straightforward as it first appears? Based on feedback from Moses, I'll decide if this is worth progressing in this phase of development.

cessda-bitbucket-importer commented 5 years ago

Original comment by Ashley Fox.

That always seems to be the case in my experience! In the meantime, I will make some tweaks to anchor elements to make them more crawler friendly. We might get fairly good index coverage (at least for English) just by allowing crawlers to navigate the result pagination links.

cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).

Hi Both @‌jws_mo and @‌AshleyFox,

Nothing more to add to Ashley's excellent points. I can only emphasise:

"You are asking one search engine (Google) to index another (CDC)."
"The UI does not link to every possible indexed study in its markup, nor would you want to with many thousands of records"

What we have is a search engine with dynamically generated content. You can obviously work around this as Ashley alluded to by building a service:

That crawls all N number of indices
Pull the IDs and build a url search string that would lead to a detailed page for a given record
Pull these urls in a conformant xml
Albeit, the limitation Ashley mentioned. Though with the current number of records I believe we should be under the max quotas, but this obviously would break as CDC increases number repos / Or SPs suddenly have a surge in Records.

I would be cautious here:

CDC is not a website and probably should not be treated as such. CDC is a curator of individually searchable/crawl-able Records out there.
The SPs have these Records as website pages publicly available and are already crawl-able I would imagine by google and numerous other search engines for free.
Here is google's sitemap with no landing pages for its searchable contents (Records)

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

“Please ensure that you have submitted your sitemap to the search console to have our Googlebot begin crawling your pages for markup. I have included documentation on how to do this as reference: https://support.google.com/webmasters/answer/183668?hl=en. Please note that it may take some time before your data will appear on the feature.

In addition, as the feature has officially launched, it would be best that any further questions or concerns that you have are posted on our webmaster forums where you will have a wealth of information and resources to answer any potential questions you may have. I have included a link to the forum below for your reference: https://support.google.com/webmasters/community “

Structured data testing tool:

https://search.google.com/structured-data/testing-tool

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

“Please ensure that you have submitted your sitemap to the search console to have our Googlebot begin crawling your pages for markup. I have included documentation on how to do this as reference: https://support.google.com/webmasters/answer/183668?hl=en. Please note that it may take some time before your data will appear on the feature.

In addition, as the feature has officially launched, it would be best that any further questions or concerns that you have are posted on our webmaster forums where you will have a wealth of information and resources to answer any potential questions you may have. I have included a link to the forum below for your reference: https://support.google.com/webmasters/community “

Structured data testing tool:

https://search.google.com/structured-data/testing-tool

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).

Matthew has created a script that updates the pages of links for each of the languages ( https://datacatalogue.cessda.eu/cmmstudy_XX_index.xml, where 'XX' is the ISO 2 letter language code). This is run by a Jenkins job (https://jenkins.cessda.eu/view/DataCat/job/cessda.cdc.sitemapgenerator/) on a daily basis. See also https://datacatalogue.cessda.eu/sitemap_index.xml

Note that the Jenkins job/scrip runs against all three instances of CDC, but the Google crawler can only see the production instance, due to the simple authentication challenge in from to the dev and staging sites.

cessda / cessda.cdc.versions

Submit sitemap for Google data search #89