Open rosado opened 3 months ago
Some additional things to think about:
Initial thought: could we disallow anything after /organisations/local-authority:AAA/
? That’d mean the organisation finder and an organisation’s dashboard is searchable, but anything ‘downstream’ of that isn’t crawled, indexed or searchable.
Thanks for these thoughts, @DilwoarH! Some follow-up comments:
- If our objective with this project is to make data more available then all pages that include data should not be excluded in robots.txt
Ideally we want users to find and get the data on the main site or API, e.g. https://www.planning.data.gov.uk/dataset/conservation-area
- If crawlers are causing load problems then we should fix the scaling issue
- We should exclude pages that are transactional, like the check service, as these rely on things being inputted.
- The start page of the check service should not be excluded
Agree on these points.
We should to block certain web crawlers from putting extra load on the database. We need to make a list of URLs that should be excluded from crawling and add them to
robots.txt
file.