ckan / ckanext-dcat

CKAN ♥ DCAT
164 stars 142 forks source link

Datasets not found on Google Dataset Search #199

Open maxclac opened 2 years ago

maxclac commented 2 years ago

Hi,

I am running CKAN 2.9.2 on Ubuntu 20 and I installed the DCAT plugin. I followed the instructions on the README file (activating the structured_data and dcat plugins) in order to have my Datasets discovered by Google Dataset Search but this has not happened until now.

What could I be missing?

Best regards

metaodi commented 2 years ago

Did you verify if the structured data is generated in the frontend (i.e. view source and check for a json+ld block)? Maybe you have customized your frontend?

Then you could check if the schema validator indicates any errors for your domain (test with the URL of a dataset).

maxclac commented 2 years ago

Hi @metaodi and thank you for your answer. The validator does not indicate any error and it seems my urls are correct.

anuveyatsu commented 2 years ago

We also had some issues with indexing datasets by Google Dataset Search. Only a few datasets get indexed.

sagargg commented 2 years ago

Maybe google dataset search require standard JSON-LD structure for indexing https://developers.google.com/search/docs/advanced/structured-data/dataset#example

metaodi commented 2 years ago

@sagargg this is exactly what this extension provides. But it's hard to tell what went wrong with no further details.

maxclac commented 2 years ago

Thank you @metaodi for your answer.

The JSON+LD is correctly formed. As I have no former experience with letting crawlers access a website, I was not aware of the necessity to take care of a robots.txt file and a sitemap. I realized it is important to read the Google Search guidelines before using the extension. Are there CKAN-specific instructions about setting up a robots.txt and a sitemap?

metaodi commented 2 years ago

No there is nothing CKAN specific. We use this extension on the open data catalogue of the City of Zurich, and it works for us.

See the Google Dataset Search help page for specific instructions: https://datasetsearch.research.google.com/help

Hope this helps.

maxclac commented 2 years ago

Thanks! Is a robots.txt really needed? I thought that, when none is given, Google would just crawl everything.

metaodi commented 2 years ago

No, it's not necessary. But since I don't know your setup, it could be that an existing robots.txt is blocking the google crawler.

Just something to keep in mind.

maxclac commented 2 years ago

I see. I am not aware of any pre-existing robots.txt in my CKAN instance. Maybe if I explicitly put one, the indexing will work.