Closed whiteroses closed 7 years ago
Accept.best_match()
does currently implement the matching scheme that is used in Apache, https://github.com/Pylons/webob/issues/239 so we would want to make sure we continue to support that, simply because it is the only other implementation of the RFC standards.
Other than that, I am a big fan of changing things for the various Accept-*
headers so that we can better support those other use cases.
Dropping things that are undocumented, is a-okay with me, but we should be aware that WebOb has been around a long time and what people are using may not all be documented.
When I looked at the Apache documentation on content negotiation at http://httpd.apache.org/docs/current/content-negotiation.html a while back, I remember thinking that it didn't seem suitable for WebOb, as the algorithm (if I remember correctly) included Transparent Content Negotiation (which WebOb doesn't support) and involved all four headers (whereas WebOb handles each header individually). But I will go through and work on understanding your work in #239 in June, and we can discuss it further then.
Backward compatibility-wise, happy to keep anything around that you think we should keep around, and I'll try to (where possible) leave the deprecating/removing of any code that people might rely on towards the end, to give us plenty of time to decide the best thing to do.
@bertjwregeer, @mmerickel, @stevepiercy:
This is what I have for the regex and .parse()
for Accept-Language
so far:
https://gist.github.com/whiteroses/e2134e01e4a30f0b25a0e48ad51e0fc2
1) In RFC 7230, section 7:
> For compatibility with legacy list rules, a recipient MUST parse and ignore a
reasonable number of empty list elements: enough to handle common mistakes by senders that merge values, but not so much that they could be used as a denial-of-service mechanism.
Should we limit the number of empty list elements? Both the ABNF and the regex right now is "*", i.e. 0 or more.
2) It looks to me like we have to check the validity of the header with one regex, then get the groups with another regex. Do you see a better/more efficient way?
3) In test_acceptparse.py, test_init_accept_accept_language()
, are 0.80000000000000004 and 0.69999999999999996 used instead of 0.8 and 0.7 to test that they are floats?
Should we limit the number of empty list elements? Both the ABNF and the regex right now is "*", i.e. 0 or more.
I don't think that is necessary. 0 or more is fine, I don't think it'll be as big of a denial of service mechanism.
For number 3, here is the authoritative Python documentation: https://docs.python.org/3/tutorial/floatingpoint.html
From the Java Explorer blog:
2) If I understand correctly, *?
is faster than *
when the match is short (the average case), but slower when the match is long (worst case, as in the case of an attacker). But actually checking with timeit
:
With the regex in the gist:
>>> timeit.timeit(stmt="accept_language_compiled_re.match('segment' + '-segment' * 1000 + ';q=0.123' + ', segment-segment;q=0.123' * 1000)", setup="from webob.acceptparse import accept_language_compiled_re", number=1000)
6.565146894994541
If we switched the `*`s to `*?` wherever possible:
>>> timeit.timeit(stmt="accept_language_compiled_re.match('segment' + '-segment' * 1000 + ';q=0.123' + ', segment-segment;q=0.123' * 1000)", setup="from webob.acceptparse import accept_language_compiled_re", number=1000)
6.578283832001034
There doesn't seem to be a difference. Which should we use?
3) Python's re module doesn't have possessive quantifiers, right?
5) I didn't think of this! Have changed it, will remember from now on.
Do you think we should add some sanity-check upper limit to the number of repetitions instead of all the "*"s?
Some of the points in the Java Explorer blog are easy to miss — if you see any way I can improve the regex, please let me know and I'll fix!
On floats (question 3), the values we're comparing to in the test are also floats, so from what I understand it shouldn't make any difference? We're not doing any calculations on the floats.
RFC 2616 specifically says that the qvalues are 'short "floating point" numbers', but this has been removed in RFC 7231. I can't imagine why anyone would need qvalues to be that precise though.
On floats (question 3), the values we're comparing to in the test are also floats, so from what I understand it shouldn't make any difference? We're not doing any calculations on the floats.
I agree with you, since 0.8 == 0.80000000000000004
is True on Python because of the implicit conversion, I am not sure why someone went through the trouble of using the full precision there.
RFC 2616 specifically says that the qvalues are 'short "floating point" numbers', but this has been removed in RFC 7231. I can't imagine why anyone would need qvalues to be that precise though.
In the test the q value is 0.8
however because of Python's floating point being IEEE 754, the actual number Python uses is 0.80000000000000004
. The test is verifying that the return from the function matches that floating point number since it turned 0.8
text into 0.8
float.
Do we have any tests where the q
value has more than two decimal places? If so with your new regex those won't match or work anymore.
Sorry, 4. should have been 5.: "5) Use alternation wisely" in the Java Explorer blog. Just letting you know I got it and fixed it!
By "I can't imagine why anyone would need qvalues to be that precise though", I was replying to
It may be worth looking at alternate ways of presenting the quality value rather than as a floating point number
I don't see why anyone would need qvalues that are more precise than floats, so even though they no longer specify floats in the RFC, I think floats should be fine.
My new regex matches qvalue of three decimal places. There is one test for the Accept
class that uses a qvalue of four decimal places (test_best_match_with_complex_q
in test_acceptparse.py), so I'll need to update that when we get to the Accept
class.
Updating that to match the standard is perfect :-)
With regard to __contains__
and quality
I think the conclusion that these do not make sense is not correct. They are in fact not matching algorithms and are not attempting to perform any complex lookup - rather they are looking at one single offer and determining whether it falls into the list of provided options in the header string. The lookup algorithms are dealing with a list of offers and determining which one is best for a given header string. Similarly with __iter__
it's a convenience that wouldn't be used for any complex content negotiation.. it is not following any particular RFC but rather just doing what a human might try to do when reading the original string.
As far as _parsed_nonzero
I see no real need for that one to be a public API. parsed
is of course useful though.
With _parsed_nonzero
, I'd arrived at the same conclusion, that it's easy enough for anyone who wants it to obtain it from parsed
, so didn't expose _parsed_nonzero
in this PR — sorry, should have updated here.
With __contains__
and quality
: they both use ._match()
, and if we look at AcceptLanguage._match()
, we see the matching that was identified in https://github.com/Pylons/webob/issues/256 as the source of the reported issue (see this comment) (you may recall me talking a lot about how it matches both more and less specific language tags, when no matching scheme in RFC 4647 does).
To use an example: say the Accept-Language header is foo
, and the offer we have is foo-bar
. Should __contains__
return True? Should .quality('foo-bar')
return the qvalue for foo
? With the current implementation, they do, because ._match()
matches both more and less specific language tags:
>>> 'foo' in Request.blank('/', accept_language='foo-bar').accept_language
True
>>> 'foo-bar' in Request.blank('/', accept_language='foo').accept_language
True
>>> Request.blank('/', accept_language='foo').accept_language.quality('foo-bar')
1.0
>>> Request.blank('/', accept_language='foo-bar').accept_language.quality('foo')
1.0
But that is not behaviour specified in either RFC 7231 or 4647. When you say "...rather they are looking at one single offer and determining whether it falls into the list of provided options in the header string", there is no one sense of "whether it falls into": we can match more specific, or we can match less specific. We still need to choose one meaning of what is a match for something like __contains__
.
An important point from RFC 4647:
Filtering seems to imply that there is a semantic relationship between language tags that share the same prefix. While this is often the case, it is not always true: the language tags that match a specific language range do not necessarily represent mutually intelligible languages.
That is why whether we match more or less specific language tags is important, and depends on the needs of the application.
There is another option for __contains__
, where we do a plain simple string match. But that is not what it is currently doing, given that it uses ._match()
. And it is not what is documented for __contains__
: the reference page is where the Accept* headers handling is currently documented (the API docs are minimal); there, in
(__contains__
) is documented as a "test for acceptability".
It is worth mentioning at this point that my understanding of RFC 7231 has changed, as mentioned in the comment here (would really appreciate if you could review it, to check that we agree on the interpretation). If the interpretation there is correct, then only basic language ranges are expected in the Accept-Language header. This means that the Extended Filtering matching scheme in RFC 4647 no longer applies; which in turn means we are left with two matching schemes: Basic Filtering, which matches equally or more specific language tags (offers), and returns a set; and Lookup, which matches equally or less specific language tags, and returns one single best match.
So I think we have three options:
1) Keep __contains__
and quality()
as they are, and continue to support the current meaning for what a match is, alongside the new matching schemes, for backward compatibility reason. Document them as specifically for supporting legacy matching.
2) Simplify __contains__
and .quality()
to do a simple string match. That makes the most sense to me, but breaks backward compatibility.
3) Use Basic Filtering for __contains__
, and document that it is using Basic Filtering. Give .quality()
a parameter for specifying the matching scheme, perhaps with the default of Basic Filtering. (A quality()
like that may not be simple to implement.) That still breaks backward compatibility, but to a lesser degree than 2.
As mentioned in the first comment, I think __iter__
is ok as is, but it may be confusing if its behaviour is not consistent with what __contains__
does, where e.g. something appears in the list()
but is not in
the header.
On WebOb's current matching and backward compatibility: I can't imagine it being ok to break things like in
/__contains__
or change the algorithm of how they work such that they return a different result, so I think we'll have to keep most of the methods and special methods that use the old matching algorithm around wherever possible, and document that they use the old algorithm. Maybe with a warning that they may be deprecated in a year or two?
Yes I would start with that assumption and isolate your implementation to the specific lookup / matching methods.
Once the new implementation is in place, it is going to be easier to start looking at how to replace the current in
/__contains__
. I wouldn't be opposed to replacing them in the future if it means we can do away with the current implementation.
As discussed in issue #256,
Accept.best_match()
implements its own algorithm that does not match any of the matching schemes suggested in RFC7231. We agreed that there was not just one meaning of what's to be considered a 'best match', so the plan is to implement the three matching schemes suggested in RFC7231 as methods, and deprecate.best_match()
. This will resolve issue #256.Given there is more than one meaning of a 'best match', anything that uses
self._match()
no longer makes sense:.__contains__()
According to its docstring,
.__contains__()
"returns True if the given object is listed in the accepted types" — which is quite ambiguous in its meaning. Looking at the code, it returns True if the offer matches anything inself._parsed_nonzero
. But as it callsself._match()
, and what is a match is now going to depend on the matching scheme, using__contains__
for this no longer makes sense.Using
__contains__
/in
to return the acceptable language tags seems to my mind not to fit the idea ofin
anyway, if as according to RFC7231, a qvalue of 0 means "not acceptable" — so a language range with q=0 would bein
the header, but not only would not be a match, but might disqualify matches of other language ranges (this is not made clear at all by the RFCs, but it seems to me the only way to interpret "not acceptable"). I think it would make more sense for__contains__
to return True if the argument is one of the language ranges in the header (whatever its qvalue, including if it's 0), but this is not the current behaviour as documented in the docs. Keeping it as is however would not make sense — as what is a match will depend on the matching scheme..quality()
From the docstring of
Accept.quality()
,.quality()
is intended to "return the quality of the given offer. Returns None if there is no match."Reading
.quality()
, it seems "the quality of the given offer" refers to the highest qvalue from all the language ranges that match. Given we've agreed that what is a match depends on the matching scheme,.quality()
would also no longer make sense.(Edited 2017/06/12: sorry, my mistake: it is mentioned in the docs.).quality()
is not mentioned in the docs at all, so there may be less of an issue with deprecating it.Other issues:
Expose
.parsed
and.parsed_nonzero
As discussed in issue #256.
Bugs to do with handling of language ranges with q=0 or more than two subtags
Those will be fixed when the new matching schemes are implemented, and
best_match()
deprecated.__iter__
__iter__
only yields fromself._parsed_nonzero
. This is as documented:There is an argument that like
__contains__
, only using the language ranges from_parsed_nonzero
doesn't make sense, given that language ranges with q=0 are still important parts of the header — so alist()
of the header shouldn't just drop them. (WebOb's currentAccept-Language
handling generally doesn't take into account the importance of q=0.) But I feel less strongly about this with__iter__
than within
/__contains__
; I can see usinglist()
to retrieve only those language ranges that the client accepts, and given that it's documented, I understand the importance of backward compatibility. But then if we changedin
/__contains__
to include ranges with q=0, but kept__iter__
the same — might that be confusing?Ideally, if backward compatibility weren't a problem, I would propose using named methods for these functionalities instead, where their meaning would be more obvious, and there would be much less room for ambiguity. It may be good to add the methods anyway as clearer alternatives, even if we keep the current meaning of
__iter__
(__contains__
would have to change regardless, because it currently relies on one single meaning of matching).(Need a little more time to finish reviewing the regex for
Accept-Language
,.parse()
and__str__
— will update here when done!)