IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
876 stars 484 forks source link

provide a production-ready example of robots.txt in the guides #3666

Closed pameyer closed 7 years ago

pameyer commented 7 years ago

Possibly related to #2274; current robots.txt advises bots to not index the entire site (at least, judging by wget -r default behavior and https://en.wikipedia.org/wiki/Robots_exclusion_standard).

pameyer commented 7 years ago

Using the previously commented-out list of recommended exclusions; unsure if that list should be updated or not.

pameyer commented 7 years ago

Judging by updates to robots.txt on IQSS production, the list of recommended exclusions does need to be updated.

pameyer commented 7 years ago

Slightly more detail on testing:

attempted to stress-test a development dataverse install by exhaustive crawling with: wget -r "$DATAVERSE_HOST" ; wget respected robots.txt and did not download any files - reflecting that (polite) search engine crawlers would not download and/or index any dataverse or dataset pages. The updated robots.txt resolves this; with recursive wget downloading more than the index page.

kcondon commented 7 years ago

Have discussed with Pete. By default we want robots.txt to be locked down since we were getting crawlers on our test and dev env's. However, it may make sense to provide an example of an opened robots.txt in the guides somewhere, similar to what we have in production now:

User-agent: Googlebot Allow: /$ Allow: /dataverse.xhtml Allow: /dataset.xhtml Disallow: / Crawl-delay: 20 User-agent: * Disallow: /

Our current production policy is to let Google crawl our site but keep others out since we were getting bombarded with crawlers and not all behaving well.

pdurbin commented 7 years ago

Pull request #3667 was closed with "Closing pull request since what we want is locked down by default but perhaps provide a production-ready example in the guides." Perhaps that's what the definition of done is for this issue.

pameyer commented 7 years ago

@pdurbin - your definition of done for this issue matches my understanding.

pdurbin commented 7 years ago

I'm taking a look at pull request #3804. @pameyer one quick thing you could do is fix the "connects to" syntax so that the pull request is associated with the issue in https://waffle.io/IQSS/dataverse . Please note the difference between that pull requests and other issues in code review:

screen shot 2017-04-28 at 10 29 42 am

Probably our pull request template is confusing. I noted this in #3729.

pdurbin commented 7 years ago

@pameyer can you please review the changes I just made to pull request #3804? If you're happy, I'm happy and we can move this issue to QA. Please feel free to go ahead make make further edits! At standup @sekmiller also said he'd take a look. Thanks!

pameyer commented 7 years ago

@pdurbin I'm happy - your changes made it better.

I'll give @sekmiller a chance to take a look; but if he doesn't see anything I think it's ready for QA.

pdurbin commented 7 years ago

@pameyer sounds good. I'm taking you off this issue. Thanks for taking a look. Let's give @sekmiller some time to give some feedback as well.