Review Accept-Language header handling

whiteroses commented 7 years ago

As discussed in issue #256, Accept.best_match() implements its own algorithm that does not match any of the matching schemes suggested in RFC7231. We agreed that there was not just one meaning of what's to be considered a 'best match', so the plan is to implement the three matching schemes suggested in RFC7231 as methods, and deprecate .best_match(). This will resolve issue #256.

Given there is more than one meaning of a 'best match', anything that uses self._match() no longer makes sense:

`.contains()`

According to its docstring, .__contains__() "returns True if the given object is listed in the accepted types" — which is quite ambiguous in its meaning. Looking at the code, it returns True if the offer matches anything in self._parsed_nonzero. But as it calls self._match(), and what is a match is now going to depend on the matching scheme, using __contains__ for this no longer makes sense.

Using __contains__/in to return the acceptable language tags seems to my mind not to fit the idea of in anyway, if as according to RFC7231, a qvalue of 0 means "not acceptable" — so a language range with q=0 would be in the header, but not only would not be a match, but might disqualify matches of other language ranges (this is not made clear at all by the RFCs, but it seems to me the only way to interpret "not acceptable"). I think it would make more sense for __contains__ to return True if the argument is one of the language ranges in the header (whatever its qvalue, including if it's 0), but this is not the current behaviour as documented in the docs. Keeping it as is however would not make sense — as what is a match will depend on the matching scheme.

`.quality()`

From the docstring of Accept.quality(), .quality() is intended to "return the quality of the given offer. Returns None if there is no match."

Reading .quality(), it seems "the quality of the given offer" refers to the highest qvalue from all the language ranges that match. Given we've agreed that what is a match depends on the matching scheme, .quality() would also no longer make sense.

~~.quality() is not mentioned in the docs at all, so there may be less of an issue with deprecating it.~~ (Edited 2017/06/12: sorry, my mistake: it is mentioned in the docs.)

Other issues:

Expose `.parsed` and `.parsed_nonzero`

As discussed in issue #256.

Bugs to do with handling of language ranges with q=0 or more than two subtags

Those will be fixed when the new matching schemes are implemented, and best_match() deprecated.

`iter`

__iter__ only yields from self._parsed_nonzero. This is as documented:

If we just want to know everything the client prefers, in the order it is preferred:
>>> list(req.accept)
['application/xhtml+xml', 'text/html']

There is an argument that like __contains__, only using the language ranges from _parsed_nonzero doesn't make sense, given that language ranges with q=0 are still important parts of the header — so a list() of the header shouldn't just drop them. (WebOb's current Accept-Language handling generally doesn't take into account the importance of q=0.) But I feel less strongly about this with __iter__ than with in/__contains__; I can see using list() to retrieve only those language ranges that the client accepts, and given that it's documented, I understand the importance of backward compatibility. But then if we changed in/__contains__ to include ranges with q=0, but kept __iter__ the same — might that be confusing?

Ideally, if backward compatibility weren't a problem, I would propose using named methods for these functionalities instead, where their meaning would be more obvious, and there would be much less room for ambiguity. It may be good to add the methods anyway as clearer alternatives, even if we keep the current meaning of __iter__ (__contains__ would have to change regardless, because it currently relies on one single meaning of matching).

(Need a little more time to finish reviewing the regex for Accept-Language, .parse() and __str__ — will update here when done!)

digitalresistor commented 7 years ago

Accept.best_match() does currently implement the matching scheme that is used in Apache, https://github.com/Pylons/webob/issues/239 so we would want to make sure we continue to support that, simply because it is the only other implementation of the RFC standards.

Other than that, I am a big fan of changing things for the various Accept-* headers so that we can better support those other use cases.

Dropping things that are undocumented, is a-okay with me, but we should be aware that WebOb has been around a long time and what people are using may not all be documented.

whiteroses commented 7 years ago

When I looked at the Apache documentation on content negotiation at http://httpd.apache.org/docs/current/content-negotiation.html a while back, I remember thinking that it didn't seem suitable for WebOb, as the algorithm (if I remember correctly) included Transparent Content Negotiation (which WebOb doesn't support) and involved all four headers (whereas WebOb handles each header individually). But I will go through and work on understanding your work in #239 in June, and we can discuss it further then.

Backward compatibility-wise, happy to keep anything around that you think we should keep around, and I'll try to (where possible) leave the deprecating/removing of any code that people might rely on towards the end, to give us plenty of time to decide the best thing to do.

whiteroses commented 7 years ago

@bertjwregeer, @mmerickel, @stevepiercy:

This is what I have for the regex and .parse() for Accept-Language so far:

https://gist.github.com/whiteroses/e2134e01e4a30f0b25a0e48ad51e0fc2

1) In RFC 7230, section 7:

> For compatibility with legacy list rules, a recipient MUST parse and ignore a

reasonable number of empty list elements: enough to handle common mistakes by senders that merge values, but not so much that they could be used as a denial-of-service mechanism.

Should we limit the number of empty list elements? Both the ABNF and the regex right now is "*", i.e. 0 or more.

2) It looks to me like we have to check the validity of the header with one regex, then get the groups with another regex. Do you see a better/more efficient way?

3) In test_acceptparse.py, test_init_accept_accept_language(), are 0.80000000000000004 and 0.69999999999999996 used instead of 0.8 and 0.7 to test that they are floats?

digitalresistor commented 7 years ago

Should we limit the number of empty list elements? Both the ABNF and the regex right now is "*", i.e. 0 or more.

I don't think that is necessary. 0 or more is fine, I don't think it'll be as big of a denial of service mechanism.
I don't at the moment. One thing we do need to be concerned with is that regexes may lead to high CPU usage of an attacker sends a very long list for example. We want to make sure that we don't have too many lookbacks/lookarounds in our regex, or if we have them to be careful that they are limited... same thing with greedy vs non-greedy. See for example https://eyalsch.wordpress.com/2009/05/21/regex/. This is something I noticed in the original implementation too
The numbers 0.8 and 0.7 are not able to be represented as float types due to IEEE 754. This is shown here where the float/min/max happens where it gets rounded to the nearest possible floating point number: https://github.com/Pylons/webob/blob/master/src/webob/acceptparse.py#L50 It may be worth looking at alternate ways of presenting the quality value rather than as a floating point number, but for now I am not as worried about that.

digitalresistor commented 7 years ago

For number 3, here is the authoritative Python documentation: https://docs.python.org/3/tutorial/floatingpoint.html

whiteroses commented 7 years ago

From the Java Explorer blog:

2) If I understand correctly, *? is faster than * when the match is short (the average case), but slower when the match is long (worst case, as in the case of an attacker). But actually checking with timeit:

With the regex in the gist:

    >>> timeit.timeit(stmt="accept_language_compiled_re.match('segment' + '-segment' * 1000 + ';q=0.123' + ', segment-segment;q=0.123' * 1000)", setup="from webob.acceptparse import accept_language_compiled_re", number=1000)
    6.565146894994541

If we switched the `*`s to `*?` wherever possible:

    >>> timeit.timeit(stmt="accept_language_compiled_re.match('segment' + '-segment' * 1000 + ';q=0.123' + ', segment-segment;q=0.123' * 1000)", setup="from webob.acceptparse import accept_language_compiled_re", number=1000)
    6.578283832001034

There doesn't seem to be a difference. Which should we use?

3) Python's re module doesn't have possessive quantifiers, right?

5) I didn't think of this! Have changed it, will remember from now on.

Do you think we should add some sanity-check upper limit to the number of repetitions instead of all the "*"s?

Some of the points in the Java Explorer blog are easy to miss — if you see any way I can improve the regex, please let me know and I'll fix!

On floats (question 3), the values we're comparing to in the test are also floats, so from what I understand it shouldn't make any difference? We're not doing any calculations on the floats.

RFC 2616 specifically says that the qvalues are 'short "floating point" numbers', but this has been removed in RFC 7231. I can't imagine why anyone would need qvalues to be that precise though.

digitalresistor commented 7 years ago

I'm okay keeping things as is, as long as we are careful not to cause issues.
Python doesn't have that indeed.
? I am not sure what this is answering.

On floats (question 3), the values we're comparing to in the test are also floats, so from what I understand it shouldn't make any difference? We're not doing any calculations on the floats.

I agree with you, since 0.8 == 0.80000000000000004 is True on Python because of the implicit conversion, I am not sure why someone went through the trouble of using the full precision there.

RFC 2616 specifically says that the qvalues are 'short "floating point" numbers', but this has been removed in RFC 7231. I can't imagine why anyone would need qvalues to be that precise though.

In the test the q value is 0.8 however because of Python's floating point being IEEE 754, the actual number Python uses is 0.80000000000000004. The test is verifying that the return from the function matches that floating point number since it turned 0.8 text into 0.8 float.

Do we have any tests where the q value has more than two decimal places? If so with your new regex those won't match or work anymore.

whiteroses commented 7 years ago

Sorry, 4. should have been 5.: "5) Use alternation wisely" in the Java Explorer blog. Just letting you know I got it and fixed it!

By "I can't imagine why anyone would need qvalues to be that precise though", I was replying to

It may be worth looking at alternate ways of presenting the quality value rather than as a floating point number

I don't see why anyone would need qvalues that are more precise than floats, so even though they no longer specify floats in the RFC, I think floats should be fine.

My new regex matches qvalue of three decimal places. There is one test for the Accept class that uses a qvalue of four decimal places (test_best_match_with_complex_q in test_acceptparse.py), so I'll need to update that when we get to the Accept class.

digitalresistor commented 7 years ago

Updating that to match the standard is perfect :-)

mmerickel commented 7 years ago

With regard to __contains__ and quality I think the conclusion that these do not make sense is not correct. They are in fact not matching algorithms and are not attempting to perform any complex lookup - rather they are looking at one single offer and determining whether it falls into the list of provided options in the header string. The lookup algorithms are dealing with a list of offers and determining which one is best for a given header string. Similarly with __iter__ it's a convenience that wouldn't be used for any complex content negotiation.. it is not following any particular RFC but rather just doing what a human might try to do when reading the original string.

As far as _parsed_nonzero I see no real need for that one to be a public API. parsed is of course useful though.

whiteroses commented 7 years ago

With _parsed_nonzero, I'd arrived at the same conclusion, that it's easy enough for anyone who wants it to obtain it from parsed, so didn't expose _parsed_nonzero in this PR — sorry, should have updated here.

With __contains__ and quality: they both use ._match(), and if we look at AcceptLanguage._match(), we see the matching that was identified in https://github.com/Pylons/webob/issues/256 as the source of the reported issue (see this comment) (you may recall me talking a lot about how it matches both more and less specific language tags, when no matching scheme in RFC 4647 does).

To use an example: say the Accept-Language header is foo, and the offer we have is foo-bar. Should __contains__ return True? Should .quality('foo-bar') return the qvalue for foo? With the current implementation, they do, because ._match() matches both more and less specific language tags:

>>> 'foo' in Request.blank('/', accept_language='foo-bar').accept_language
True
>>> 'foo-bar' in Request.blank('/', accept_language='foo').accept_language
True
>>> Request.blank('/', accept_language='foo').accept_language.quality('foo-bar')
1.0
>>> Request.blank('/', accept_language='foo-bar').accept_language.quality('foo')
1.0

But that is not behaviour specified in either RFC 7231 or 4647. When you say "...rather they are looking at one single offer and determining whether it falls into the list of provided options in the header string", there is no one sense of "whether it falls into": we can match more specific, or we can match less specific. We still need to choose one meaning of what is a match for something like __contains__.

An important point from RFC 4647:

Filtering seems to imply that there is a semantic relationship between language tags that share the same prefix. While this is often the case, it is not always true: the language tags that match a specific language range do not necessarily represent mutually intelligible languages.

That is why whether we match more or less specific language tags is important, and depends on the needs of the application.

There is another option for __contains__, where we do a plain simple string match. But that is not what it is currently doing, given that it uses ._match(). And it is not what is documented for __contains__: the reference page is where the Accept* headers handling is currently documented (the API docs are minimal); there, in (__contains__) is documented as a "test for acceptability".

It is worth mentioning at this point that my understanding of RFC 7231 has changed, as mentioned in the comment here (would really appreciate if you could review it, to check that we agree on the interpretation). If the interpretation there is correct, then only basic language ranges are expected in the Accept-Language header. This means that the Extended Filtering matching scheme in RFC 4647 no longer applies; which in turn means we are left with two matching schemes: Basic Filtering, which matches equally or more specific language tags (offers), and returns a set; and Lookup, which matches equally or less specific language tags, and returns one single best match.

So I think we have three options: 1) Keep __contains__ and quality() as they are, and continue to support the current meaning for what a match is, alongside the new matching schemes, for backward compatibility reason. Document them as specifically for supporting legacy matching. 2) Simplify __contains__ and .quality() to do a simple string match. That makes the most sense to me, but breaks backward compatibility. 3) Use Basic Filtering for __contains__, and document that it is using Basic Filtering. Give .quality() a parameter for specifying the matching scheme, perhaps with the default of Basic Filtering. (A quality() like that may not be simple to implement.) That still breaks backward compatibility, but to a lesser degree than 2.

As mentioned in the first comment, I think __iter__ is ok as is, but it may be confusing if its behaviour is not consistent with what __contains__ does, where e.g. something appears in the list() but is not in the header.

whiteroses commented 7 years ago

On WebOb's current matching and backward compatibility: I can't imagine it being ok to break things like in/__contains__ or change the algorithm of how they work such that they return a different result, so I think we'll have to keep most of the methods and special methods that use the old matching algorithm around wherever possible, and document that they use the old algorithm. Maybe with a warning that they may be deprecated in a year or two?

mmerickel commented 7 years ago

Yes I would start with that assumption and isolate your implementation to the specific lookup / matching methods.

digitalresistor commented 7 years ago

Once the new implementation is in place, it is going to be easier to start looking at how to replace the current in/__contains__. I wouldn't be opposed to replacing them in the future if it means we can do away with the current implementation.

Pylons / webob