page can not be crawled due to robots.txt

pvgenuchten commented 5 years ago

We have set up a ggl crawling on demo.pygeoapi.io to research crawler behaviour on pygeoapi. First results are available, but it puzzels me a bit.

Ggl generally crawls pygeoapi pages in a correct way. One can indeed find demo pygeoapi results at for example https://www.google.com/search?q=site%3Ademo.pygeoapi.io. however no results yet at https://toolbox.google.com/datasetsearch/search?query=site%3Ademo.pygeoapi.io

A weird thing is that when doing 'live test' (a feature on ggl search console) on this url https://demo.pygeoapi.io/master/collections/lakes i get this error: "url not available to google, blocked by robots.txt"

However https://demo.pygeoapi.io/master/collections/lakes?f=html runs fine in 'live test'. This makes me wonder, does the 'live-test' crawler use the proper accept header?

Another thing to improve is the fact that https://demo.pygeoapi.io/robots.txt does not return a proper robots.txt file, but in stead a custom file-not-found page (with http status 200!)

let me now if you have any ideas

justb4 commented 5 years ago

This issue (and code changes) is really for the pygeoapi demo site repo: https://github.com/geopython/demo.pygeoapi.io . This is currently a Flask app, mainly for templating.

tomkralidis commented 4 weeks ago

@pvgenuchten is this still an issue?

geopython / demo.pygeoapi.io

page can not be crawled due to robots.txt #4