Add robots.txt - Githubissues

digital-land / submit

0 stars 1 forks source link

Add robots.txt #318

Open rosado opened 3 months ago

rosado commented 3 months ago

We should to block certain web crawlers from putting extra load on the database. We need to make a list of URLs that should be excluded from crawling and add them to robots.txt file.

DilwoarH commented 2 months ago

Some additional things to think about:

If our objective with this project is to make data more available then all pages that include data should not be excluded in robots.txt
If crawlers are causing load problems then we should fix the scaling issue
We should exclude pages that are transactional, like the check service, as these rely on things being inputted.
The start page of the check service should not be excluded

stevenjmesser commented 2 months ago

Initial thought: could we disallow anything after /organisations/local-authority:AAA/? That’d mean the organisation finder and an organisation’s dashboard is searchable, but anything ‘downstream’ of that isn’t crawled, indexed or searchable.

stevenjmesser commented 2 months ago

Thanks for these thoughts, @DilwoarH! Some follow-up comments:

If our objective with this project is to make data more available then all pages that include data should not be excluded in robots.txt

Ideally we want users to find and get the data on the main site or API, e.g. https://www.planning.data.gov.uk/dataset/conservation-area

If crawlers are causing load problems then we should fix the scaling issue

We should exclude pages that are transactional, like the check service, as these rely on things being inputted.

The start page of the check service should not be excluded

Agree on these points.