Closed JerryKwan closed 4 years ago
@vytas7 Do you get a chance to look at this? thank you. If anything I can help, feel free to leave me a message, thank you
Hi @JerryKwan, thanks for looking into this. How does this change affect small form bodies? I want to make sure that the final solution is performant for both small and large forms. Esp. since this function is also used for query strings which typically only include a few fields.
Hi @kgriffs Most of the time, It will have better performance compared with the old version. You can use the following snippets to do some tests.
import io
import time
import urllib
import urllib.parse
# NOTE(kgriffs): See also RFC 3986
_UNRESERVED = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'abcdefghijklmnopqrstuvwxyz'
'0123456789'
'-._~')
# NOTE(kgriffs): See also RFC 3986
_DELIMITERS = ":/?#[]@!$&'()*+,;="
_ALL_ALLOWED = _UNRESERVED + _DELIMITERS
_HEX_DIGITS = '0123456789ABCDEFabcdef'
# This map construction is based on urllib's implementation
_HEX_TO_BYTE = dict(((a + b).encode(), bytes([int(a + b, 16)]))
for a in _HEX_DIGITS
for b in _HEX_DIGITS)
def decode_old(encoded_uri, unquote_plus=True):
"""Decodes percent-encoded characters in a URI or query string.
This function models the behavior of `urllib.parse.unquote_plus`,
albeit in a faster, more straightforward manner.
Args:
encoded_uri (str): An encoded URI (full or partial).
Keyword Arguments:
unquote_plus (bool): Set to ``False`` to retain any plus ('+')
characters in the given string, rather than converting them to
spaces (default ``True``). Typically you should set this
to ``False`` when decoding any part of a URI other than the
query string.
Returns:
str: A decoded URL. If the URL contains escaped non-ASCII
characters, UTF-8 is assumed per RFC 3986.
"""
decoded_uri = encoded_uri
# PERF(kgriffs): Don't take the time to instantiate a new
# string unless we have to.
if '+' in decoded_uri and unquote_plus:
decoded_uri = decoded_uri.replace('+', ' ')
# Short-circuit if we can
if '%' not in decoded_uri:
return decoded_uri
# NOTE(kgriffs): Clients should never submit a URI that has
# unescaped non-ASCII chars in them, but just in case they
# do, let's encode into a non-lossy format.
decoded_uri = decoded_uri.encode('utf-8')
# PERF(kgriffs): This was found to be faster than using
# a regex sub call or list comprehension with a join.
tokens = decoded_uri.split(b'%')
decoded_uri = tokens[0]
for token in tokens[1:]:
token_partial = token[:2]
try:
decoded_uri += _HEX_TO_BYTE[token_partial] + token[2:]
except KeyError:
# malformed percentage like "x=%" or "y=%+"
decoded_uri += b'%' + token
# Convert back to str
return decoded_uri.decode('utf-8', 'replace')
def decode_new(encoded_uri, unquote_plus=True):
"""Decodes percent-encoded characters in a URI or query string.
This function models the behavior of `urllib.parse.unquote_plus`,
albeit in a faster, more straightforward manner.
Args:
encoded_uri (str): An encoded URI (full or partial).
Keyword Arguments:
unquote_plus (bool): Set to ``False`` to retain any plus ('+')
characters in the given string, rather than converting them to
spaces (default ``True``). Typically you should set this
to ``False`` when decoding any part of a URI other than the
query string.
Returns:
str: A decoded URL. If the URL contains escaped non-ASCII
characters, UTF-8 is assumed per RFC 3986.
"""
decoded_uri = encoded_uri
# PERF(kgriffs): Don't take the time to instantiate a new
# string unless we have to.
if '+' in decoded_uri and unquote_plus:
decoded_uri = decoded_uri.replace('+', ' ')
# Short-circuit if we can
if '%' not in decoded_uri:
return decoded_uri
# NOTE(kgriffs): Clients should never submit a URI that has
# unescaped non-ASCII chars in them, but just in case they
# do, let's encode into a non-lossy format.
decoded_uri = decoded_uri.encode('utf-8')
# PERF(kgriffs): This was found to be faster than using
# a regex sub call or list comprehension with a join.
tokens = decoded_uri.split(b'%')
bytes = io.BytesIO()
bytes.write(tokens[0])
# decoded_uri = tokens[0]
for token in tokens[1:]:
token_partial = token[:2]
try:
bytes.write(_HEX_TO_BYTE[token_partial])
bytes.write(token[2:])
# decoded_uri += _HEX_TO_BYTE[token_partial] + token[2:]
except KeyError:
# malformed percentage like "x=%" or "y=%+"
# decoded_uri += b'%' + token
bytes.write(b'%')
bytes.write(token)
# Convert back to str
# return decoded_uri.decode('utf-8', 'replace')
return bytes.getvalue().decode('utf-8', 'replace')
def test(encoded_uri):
print('len(encoded_uri) = ', len(encoded_uri))
start = time.time()
delta = decode_old(data)
delta = time.time() - start
print('decode_old elapsed: %s'%delta)
start = time.time()
delta = decode_new(data)
delta = time.time() - start
print('decode_new elapsed: %s'%delta)
def test_old_simple_short():
data = '1234567890'
delta = decode_old(data)
def test_new_simple_short():
data = '1234567890'
delta = decode_new(data)
def test_old_simple_long():
data = '1234567890' * 100
delta = decode_old(data)
def test_new_simple_long():
data = '1234567890' * 100
delta = decode_new(data)
def test_old_quote_plus_short():
data = urllib.parse.quote_plus('http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg')
delta = decode_old(data)
def test_new_quote_plus_short():
data = urllib.parse.quote_plus('http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg')
delta = decode_new(data)
def test_old_quote_plus_long():
data = 'http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg' * 100
data = urllib.parse.quote_plus(data)
delta = decode_old(data)
def test_new_quote_plus_long():
data = 'http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg' * 100
data = urllib.parse.quote_plus(data)
delta = decode_new(data)
if __name__ == '__main__':
import timeit
print('test short normal values using decode_old()')
print(timeit.timeit("test_old_simple_short()", setup="from __main__ import test_old_simple_short", number=1000))
print('test short normal values using decode_new()')
print(timeit.timeit("test_new_simple_short()", setup="from __main__ import test_new_simple_short", number=1000))
print('test long normal values using decode_old()')
print(timeit.timeit("test_old_simple_long()", setup="from __main__ import test_old_simple_long", number=1000))
print('test long normal values using decode_new()')
print(timeit.timeit("test_new_simple_long()", setup="from __main__ import test_new_simple_long", number=1000))
print('test short quote plus values using decode_old()')
print(timeit.timeit("test_old_quote_plus_short()", setup="from __main__ import test_old_quote_plus_short", number=1000))
print('test short quote plus values using decode_new()')
print(timeit.timeit("test_new_quote_plus_short()", setup="from __main__ import test_new_quote_plus_short", number=1000))
print('test long quote plus values using decode_old()')
print(timeit.timeit("test_old_quote_plus_long()", setup="from __main__ import test_old_quote_plus_long",number=1000))
print('test long quote plus values using decode_new()')
print(timeit.timeit("test_new_quote_plus_long()", setup="from __main__ import test_new_quote_plus_long",number=1000))
And this is the results from my end python3.7 test_encode.py test short normal values using decode_old() 0.00038655800744891167 test short normal values using decode_new() 0.00035922101233154535 test long normal values using decode_old() 0.0004725730395875871 test long normal values using decode_new() 0.00046735297655686736 test short quote plus values using decode_old() 0.027411639981437474 test short quote plus values using decode_new() 0.02820569701725617 test long quote plus values using decode_old() 1.7537030499661341 test long quote plus values using decode_new() 1.4266109309974127
Thanks @JerryKwan for investigation and benchmarks! I'll take a look at it too since I was working on modernizing form handling.
I agree our current approach is questionable since it may perform poorly under PyPy. CPython, on the other hand, is very optimized for string +=
against a local variable, but as I understand that probably hits the wall when going above Python allocator pool sizes, or something like that.
However, I find your approach using io.BytesIO
quite unorthodox. Could you also benchmark appending bytestrings to a list, and then 'b'.join(that_list)
? This is recommended in PyPy tuning guidelines as the first hand choice, and I think CPython recommends that for generic string concatenation, too.
To reiterate what @kgriffs said though, it is very unusual to use URL-encoded forms for large bodies, so our focus should be on small URLs/forms. But I definitely agree that even for large forms our handling should not blow up in the O(n²) fashion (which might be happening here from what I understand).
If necessary, we could try to check the size of encoded_uri and switch the algorithm based on that.
For small numbers of chunks, concatenation (+=) tends to be more performant vs b''.join()
. IIRC, this was the case even for 2.6 and 2.7, before a lot of optimization was done in the runtime around that. Regardless, it would be interesting to benchmark the BytesIO
approach against b''.join()
.
And we need to make sure we check all proposed solutions against 3.5,3.6,3.7 and both CPython and PyPy.
I was benchmarking generic bytestring concatenation along the lines:
import io
import timeit
DATA = """
What People are Saying
We have been using Falcon as a replacement for [framework] and we simply love
the performance (three times faster) and code base size (easily half of our
original [framework] code).
Falcon looks great so far. I hacked together a quick test for a tiny server of
mine and was ~40% faster with only 20 minutes of work.” “Falcon is rock solid
and it’s fast.
I’m loving #falconframework! Super clean and simple, I finally have the speed
and flexibility I need!
I feel like I’m just talking HTTP at last, with nothing in the middle. Falcon
seems like the requests of backend.
The source code for Falcon is so good, I almost prefer it to documentation. It
basically can’t be wrong.
What other framework has integrated support for 786 TRY IT NOW ?
"""
ITEMS = tuple(word.encode() for word in DATA.split() if len(word) > 2) * 1024
print(len(ITEMS))
def benchmark_concat(count, number=int(1e5)):
def approach1():
result = b''
for item in items:
result += item
return result
def approach2():
result = []
for item in items:
result.append(item)
return b''.join(result)
def approach3():
return b''.join(item for item in items)
def approach4():
result = io.BytesIO()
for item in items:
result.write(item)
return result.getvalue()
items = ITEMS[:count]
# Warm up functions
timeit.timeit(approach1, number=number)
timeit.timeit(approach2, number=number)
timeit.timeit(approach3, number=number)
timeit.timeit(approach4, number=number)
print(
count,
(
timeit.timeit(approach1, number=number),
timeit.timeit(approach2, number=number),
timeit.timeit(approach3, number=number),
timeit.timeit(approach4, number=number)
))
for count in (3, 5, 8, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100):
benchmark_concat(count)
benchmark_concat(1000, 10000)
benchmark_concat(10000, 1000)
benchmark_concat(100000, 10)
The output
python3.7 bench.py
110592
3 (0.02144782803952694, 0.031356992200016975, 0.044225919991731644, 0.04343338776379824)
5 (0.028610989917069674, 0.04261731030419469, 0.05122442310675979, 0.0607671239413321)
8 (0.04888148419559002, 0.06251737708225846, 0.06283955089747906, 0.07964882394298911)
10 (0.051398238632828, 0.0689754388295114, 0.0680475695990026, 0.09300704440101981)
15 (0.07813044590875506, 0.09609771007671952, 0.09159524599090219, 0.13005621312186122)
20 (0.09800408175215125, 0.13165172236040235, 0.11964801093563437, 0.15450798673555255)
30 (0.13904607575386763, 0.16506760893389583, 0.14640652015805244, 0.21083400398492813)
40 (0.1961963172070682, 0.2203828631900251, 0.18562251422554255, 0.26916765607893467)
50 (0.2484002630226314, 0.2684812028892338, 0.21915306989103556, 0.32866898318752646)
60 (0.2976999832317233, 0.31331878202036023, 0.25735463527962565, 0.3843465121462941)
70 (0.353005716111511, 0.3602594337426126, 0.29455818282440305, 0.43513593077659607)
80 (0.41087261587381363, 0.40120164630934596, 0.33170615369454026, 0.4951352826319635)
90 (0.47160040121525526, 0.45098415901884437, 0.36671631317585707, 0.5520307668484747)
100 (0.557110704947263, 0.4945222963578999, 0.4039875380694866, 0.6070281830616295)
1000 (1.151520217768848, 0.46339548798277974, 0.3604183937422931, 0.5241292156279087)
10000 (6.590064087882638, 0.45619268203154206, 0.360421403311193, 0.5087128123268485)
100000 (8.545706522185355, 0.052082180976867676, 0.04151772195473313, 0.0503646950237453)
Some takes from it, based on the amount of items (short English words in this case):
local_var +=
performs bestlocal_var +=
starts blowing up clearly worse than O(n), and interestingly, a BytesIO
starts outperforming a list, albeit very slightly.The discussion above does not include b''.join(iterator)
which starts outperforming other approaches at ~40 items. However, depending on logic, it is not always possible to convert anything to iteration without extra overhead. But for simple iterations it works great it seems.
Here comes the same benchmark for PyPy3.6:
3 (0.009178994689136744, 0.009376228787004948, 0.018578666262328625, 0.013394333887845278)
5 (0.01192251406610012, 0.01242155022919178, 0.023621411062777042, 0.01531446073204279)
8 (0.012679930310696363, 0.01512245973572135, 0.030060097109526396, 0.016992149874567986)
10 (0.014907791744917631, 0.019190983846783638, 0.03503640089184046, 0.019363478291779757)
15 (0.019128676038235426, 0.023180575110018253, 0.04538170574232936, 0.02520340494811535)
20 (0.02397815603762865, 0.029557771049439907, 0.059623430017381907, 0.032457142136991024)
30 (0.0349110858514905, 0.040592003148049116, 0.08166458597406745, 0.04336934583261609)
40 (0.04713146481662989, 0.05265957396477461, 0.10618857899680734, 0.05350310308858752)
50 (0.05956967966631055, 0.06458686897531152, 0.12860003719106317, 0.06669273413717747)
60 (0.07337774988263845, 0.07615477964282036, 0.15194117790088058, 0.07913599396124482)
70 (0.08826563088223338, 0.08873719209805131, 0.17371911695227027, 0.08756348071619868)
80 (0.10602666484192014, 0.10031882021576166, 0.20231699803844094, 0.09858379233628511)
90 (0.1267012502066791, 0.11499356525018811, 0.227923349943012, 0.11141298478469253)
100 (0.14604800380766392, 0.12487248610705137, 0.24984144000336528, 0.12241428019478917)
1000 (0.8727836729958653, 0.12858159141615033, 0.23260656977072358, 0.11514023086056113)
10000 (9.508226448204368, 0.1423673778772354, 0.2589253569021821, 0.11189583828672767)
100000 (26.066565764136612, 0.017521890345960855, 0.03390416409820318, 0.011629929300397635)
On PyPy, an iterator works worse, and a list pays off earlier. local_var +=
explodes even worse. What is even more interesting, BytesIO
(as suggested by @JerryKwan ) gets pretty performant quite early on, something I was not expecting on PyPy.
Regarding decode
and parse_query_string
, I'll look into reimplementing them in Cython.
Regarding concatenating bytestrings, I found an interesting note on newest 3.x CPython docs, they recommend bytearray_var += item
, although mention BytesIO
too as a good all-rounder. I'll add the bytearray
approach to the benchmark above.
FWIW, bytearray_var +=
approach outdoes anything else on CPython 3.7 indeed (except for low count of items).
It is less clear on PyPy though. I'll post results later.
@vytas7 Would you please share the test code? thank you
@JerryKwan the test is inlined in one of my earlier comments. I'll update it to reflect the bytearray
test too.
Thanks @JerryKwan @vytas7 for digging into this!
@vytas7 , what I mean is the bytearray test codes. And I can change my codes to do some tests in my environment using real data.
Ah, sorry, here you are (the updated version)!
import io
import timeit
DATA = """
What People are Saying
We have been using Falcon as a replacement for [framework] and we simply love
the performance (three times faster) and code base size (easily half of our
original [framework] code).
Falcon looks great so far. I hacked together a quick test for a tiny server of
mine and was ~40% faster with only 20 minutes of work.” “Falcon is rock solid
and it’s fast.
I’m loving #falconframework! Super clean and simple, I finally have the speed
and flexibility I need!
I feel like I’m just talking HTTP at last, with nothing in the middle. Falcon
seems like the requests of backend.
The source code for Falcon is so good, I almost prefer it to documentation. It
basically can’t be wrong.
What other framework has integrated support for 786 TRY IT NOW ?
"""
ITEMS = tuple(word.encode() for word in DATA.split() if len(word) > 2) * 1024
print(len(ITEMS))
def benchmark_concat(count, number=int(1e5)):
def approach1():
result = b''
for item in items:
result += item
return result
def approach2():
result = []
for item in items:
result.append(item)
return b''.join(result)
def approach3():
return b''.join(item for item in items)
def approach4():
result = io.BytesIO()
for item in items:
result.write(item)
return result.getvalue()
def approach5():
result = bytearray()
for item in items:
result += item
return bytes(result)
items = ITEMS[:count]
# Warm up functions
timeit.timeit(approach1, number=number)
timeit.timeit(approach2, number=number)
timeit.timeit(approach3, number=number)
timeit.timeit(approach4, number=number)
timeit.timeit(approach5, number=number)
print(
count,
(
timeit.timeit(approach1, number=number),
timeit.timeit(approach2, number=number),
timeit.timeit(approach3, number=number),
timeit.timeit(approach4, number=number),
timeit.timeit(approach5, number=number),
))
for count in (3, 5, 8, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100):
benchmark_concat(count)
benchmark_concat(1000, 10000)
benchmark_concat(10000, 1000)
benchmark_concat(100000, 10)
@vytas7 , thank you for providing the snipptes. Here is my testing result
test_encode.py
test short normal values using decode_old()| decode_bytesio()| decode_bytearray()
0.00036158598959445953 0.00042326096445322037 0.00036234501749277115
test long normal values using decode_old()| decode_bytesio()| decode_bytearray()
0.0005342649528756738 0.0005105879390612245 0.0004338329890742898
test short quote plus values using decode_old()| decode_bytesio()| decode_bytearray()
0.02455015096347779 0.033037879038602114 0.026454834965988994
test long quote plus values using decode_old()| decode_bytesio()| decode_bytearray()
1.8134915790287778 1.5716652729315683 1.4036276719998568
As you can see, bytesarray is the most performant way for processing huge items concatenation
@JerryKwan I have some proof of concept implementations almost working in Cython, I hope to create a PR sometime soon. If that works we can optimize the non-Cythonized version to be a good all-rounder that also performs on PyPy.
@vytas7, that's really great! Anything I can help, feel free to let me know, thank you
@JerryKwan looking at your GH profile, you're from China, right? How are HTML forms and web APIs normally looking in China? Do you use English URI patterns and query params, or are they normally in Chinese too? Regardless of details though, I feel this issue highlights that our development/ci/testing is somewhat ASCII-centric. I'll think of how we could improve this. @kgriffs thoughts?
@vytas7 Yes, I am living in China right now. Generally speaking, we use English URI patterns, and the query keys are in English too, but the query values maybe Chinese. Would you please give me your email address? thank you
@JerryKwan Could I get your form example to run some tests on?
I have rewritten the thing in Cython and it is hopefully starting to shape up, with the following caveats:
falcon.cyutil.uri.parse_query_string
functioncsv
option (yet)decode()
routine is not exposed separately, and if it was, it would not support switching off '+'
:arrow_right: ' '
(yet)You can check it out yourself too: https://github.com/vytas7/falcon/tree/cython-parse-query-string
@JerryKwan could you please retest with the Cythonized version of https://github.com/vytas7/falcon/tree/cython-parse-query-string (https://github.com/falconry/falcon/pull/1604) ?
I have not fixed it for the Python version yet.
But you could force cythonization and check if this solves your performances issues: (in your virtual environment considering you are using CPython, PyPy is still unaddressed)
pip install Cython
pip install --no-use-pep517 git+https://github.com/vytas7/falcon@cython-parse-query-string
Personally I'm seeing an improvement I believe when testing with the Cython version as per instructions above:
>>> from falcon.uri import decode
>>> import timeit
>>> DATA = 'Hello%20World%20just%20testing' * 10000
>>> timeit.timeit((lambda: decode(DATA)), number=10)
0.006680036000034306
>>> timeit.timeit((lambda: decode(DATA)), number=1000)
0.23197350000009465
>>>
Installing vanilla Falcon 2.0 from PyPi:
>>> from falcon.uri import decode
>>> import timeit
>>> DATA = 'Hello%20World%20just%20testing' * 10000
>>> timeit.timeit((lambda: decode(DATA)), number=10)
0.9825585139988107
>>> timeit.timeit((lambda: decode(DATA)), number=1000)
... didn't wait for ...
One could grab a :coffee: and compute the last value :arrow_upper_left:
@vytas7 This is the test result from my end, using the same data 1, using
pip install Cython
pip install Falcon
[root@host ~]# python3.7 test_decode.py
23.170549132861197
2, using
pip install Cython,
pip install --no-use-pep517 git+https://github.com/vytas7/falcon@cython-parse-query-string
[root@host ~]# python3.7 test_decode.py
0.010890404228121042
As you can see, your new version looks really promising.
When Cython is not available in the environment, I like the idea of branching based on len(encoded_uri)
as well as PyPy vs. CPython vs. (since the length threshold is different, and the best-performaing approach is as well).
It's really slow when invoking req._parse_form_urlencoded() if the form body is huge. And after digging deeper, I found it was caused by the function named decode(encoded_uri, unquote_plus=True) defined in falcon/util/uri.py. The root cause is the low performance of list concatenation. A little refactor can improve performance greatly. The code before:
The code optimized:
And the time costs before optimizing is python3.7 test.py 2019-10-27 23:25:38,030 urllib3.connectionpool: DEBUG Starting new HTTP connection (1): 127.0.0.1:8080 2019-10-27 23:26:04,819 urllib3.connectionpool: DEBUG http://127.0.0.1:8080 "POST /api/import/postdata/async HTTP/1.1" 200 146 2019-10-27 23:26:04,820 main : DEBUG post data return: {"status":"ok","error_msg":"","successRecords":[],"failedRecords":[],"totalSuccessRecords":500,"totalFailedRecords":0,"timeElapsed":26.7830269337} 2019-10-27 23:26:04,821 main : DEBUG time elapsed: 27.266945600509644 After optimizing is: python3.7 test.py 2019-10-27 23:54:09,490 urllib3.connectionpool: DEBUG Starting new HTTP connection (1): 127.0.0.1:8080 2019-10-27 23:54:09,791 urllib3.connectionpool: DEBUG http://127.0.0.1:8080 "POST /api/import/postdata/async HTTP/1.1" 200 145 2019-10-27 23:54:09,794 main : DEBUG post data return: {"status":"ok","error_msg":"","successRecords":[],"failedRecords":[],"totalSuccessRecords":500,"totalFailedRecords":0,"timeElapsed":0.2936162949} 2019-10-27 23:54:09,794 main : DEBUG time elapsed: 0.862856388092041
As you can see, there is a great improvement