Closed pameyer closed 7 years ago
Using the previously commented-out list of recommended exclusions; unsure if that list should be updated or not.
Judging by updates to robots.txt on IQSS production, the list of recommended exclusions does need to be updated.
Slightly more detail on testing:
attempted to stress-test a development dataverse install by exhaustive crawling with: wget -r "$DATAVERSE_HOST"
; wget
respected robots.txt and did not download any files - reflecting that (polite) search engine crawlers would not download and/or index any dataverse or dataset pages. The updated robots.txt resolves this; with recursive wget
downloading more than the index page.
Have discussed with Pete. By default we want robots.txt to be locked down since we were getting crawlers on our test and dev env's. However, it may make sense to provide an example of an opened robots.txt in the guides somewhere, similar to what we have in production now:
User-agent: Googlebot Allow: /$ Allow: /dataverse.xhtml Allow: /dataset.xhtml Disallow: / Crawl-delay: 20 User-agent: * Disallow: /
Our current production policy is to let Google crawl our site but keep others out since we were getting bombarded with crawlers and not all behaving well.
Pull request #3667 was closed with "Closing pull request since what we want is locked down by default but perhaps provide a production-ready example in the guides." Perhaps that's what the definition of done is for this issue.
@pdurbin - your definition of done for this issue matches my understanding.
I'm taking a look at pull request #3804. @pameyer one quick thing you could do is fix the "connects to" syntax so that the pull request is associated with the issue in https://waffle.io/IQSS/dataverse . Please note the difference between that pull requests and other issues in code review:
Probably our pull request template is confusing. I noted this in #3729.
@pameyer can you please review the changes I just made to pull request #3804? If you're happy, I'm happy and we can move this issue to QA. Please feel free to go ahead make make further edits! At standup @sekmiller also said he'd take a look. Thanks!
@pdurbin I'm happy - your changes made it better.
I'll give @sekmiller a chance to take a look; but if he doesn't see anything I think it's ready for QA.
@pameyer sounds good. I'm taking you off this issue. Thanks for taking a look. Let's give @sekmiller some time to give some feedback as well.
Possibly related to #2274; current robots.txt advises bots to not index the entire site (at least, judging by
wget -r
default behavior and https://en.wikipedia.org/wiki/Robots_exclusion_standard).