benjaminestes / robots

Package robots implements robots.txt file parsing and matching based on Google's specification.
https://godoc.org/github.com/benjaminestes/robots
MIT License
4 stars 2 forks source link

There should be a way to account for the status code of a request for a robots.txt file #1

Closed benjaminestes closed 5 years ago

benjaminestes commented 5 years ago

The documentation currently promises that the client only needs to check whether they are using the robots.txt which includes the interesting URL in its scope. However, the status code of a request for a robots.txt file also impacts the implied rules.

Having the client make sure to use the correct robots.txt file is the right choice. I believe this library should do the lifting. The choice is between adding another function, or amending From to take an additional status code argument.

In practice, a robots.txt file is always tied to a request for that file. Therefore, it seems reasonable to amend the From function to also take a status code. In the event (such as in testing) of a status code not being available, the desired behavior can be simulated by having the client produce their own.

I don't see an upside to having From and e.g. FromStatus.

benjaminestes commented 5 years ago

A solution to this problem suggests a different representation of a robots.txt file. Currently a Robots object holds a bunch of state, and (*Robots) methods test for URL availability under that state. However, the interest in this data lies in its behavior under a certain application, i.e. given a user agent and path, can the agent crawl the path?

What if we choose a procedural representation for Robots? Presumably you still wind up with an internal object that can hold the same state Robots does now. But when you call parse, the return type is func(name, rawurl string) bool, and the behavior is analogous to (*Robots) Test now.

Actually the API would be simplified:

func Locate(rawurl string) string
func From(response int, in io.Reader) func(name, rawurl string) bool

Possibly with a named type for func(name, rawurl) bool that allows for some extra documentation.

benjaminestes commented 5 years ago

OK — how about adding a "default to allow" boolean field to Robots? The field gets set based on the status code. Then, in the absence of a matching group, the default value is used, instead of the current implicit default of "true". This only changes behavior if the response code is 5xx.

benjaminestes commented 5 years ago

Closed by 053d199d1a