Low performance in req._parse_form_urlencoded()

JerryKwan commented 5 years ago

It's really slow when invoking req._parse_form_urlencoded() if the form body is huge. And after digging deeper, I found it was caused by the function named decode(encoded_uri, unquote_plus=True) defined in falcon/util/uri.py. The root cause is the low performance of list concatenation. A little refactor can improve performance greatly. The code before:

def decode(encoded_uri, unquote_plus=True):
    """Decodes percent-encoded characters in a URI or query string.
    This function models the behavior of `urllib.parse.unquote_plus`,
    albeit in a faster, more straightforward manner.
    Args:
        encoded_uri (str): An encoded URI (full or partial).
    Keyword Arguments:
        unquote_plus (bool): Set to ``False`` to retain any plus ('+')
            characters in the given string, rather than converting them to
            spaces (default ``True``). Typically you should set this
            to ``False`` when decoding any part of a URI other than the
            query string.
    Returns:
        str: A decoded URL. If the URL contains escaped non-ASCII
        characters, UTF-8 is assumed per RFC 3986.
    """

    decoded_uri = encoded_uri

    # PERF(kgriffs): Don't take the time to instantiate a new
    # string unless we have to.
    if '+' in decoded_uri and unquote_plus:
        decoded_uri = decoded_uri.replace('+', ' ')

    # Short-circuit if we can
    if '%' not in decoded_uri:
        return decoded_uri

    # NOTE(kgriffs): Clients should never submit a URI that has
    # unescaped non-ASCII chars in them, but just in case they
    # do, let's encode into a non-lossy format.
    decoded_uri = decoded_uri.encode('utf-8')

    # PERF(kgriffs): This was found to be faster than using
    # a regex sub call or list comprehension with a join.
    tokens = decoded_uri.split(b'%')
    decoded_uri = tokens[0]
    for token in tokens[1:]:
        token_partial = token[:2]
        try:
            decoded_uri += _HEX_TO_BYTE[token_partial] + token[2:]
        except KeyError:
            # malformed percentage like "x=%" or "y=%+"
            decoded_uri += b'%' + token

    # Convert back to str
    return decoded_uri.decode('utf-8', 'replace')

The code optimized:

import io
def decode(encoded_uri, unquote_plus=True):
    """Decodes percent-encoded characters in a URI or query string.
    This function models the behavior of `urllib.parse.unquote_plus`,
    albeit in a faster, more straightforward manner.
    Args:
        encoded_uri (str): An encoded URI (full or partial).
    Keyword Arguments:
        unquote_plus (bool): Set to ``False`` to retain any plus ('+')
            characters in the given string, rather than converting them to
            spaces (default ``True``). Typically you should set this
            to ``False`` when decoding any part of a URI other than the
            query string.
    Returns:
        str: A decoded URL. If the URL contains escaped non-ASCII
        characters, UTF-8 is assumed per RFC 3986.
    """

    decoded_uri = encoded_uri

    # PERF(kgriffs): Don't take the time to instantiate a new
    # string unless we have to.
    if '+' in decoded_uri and unquote_plus:
        decoded_uri = decoded_uri.replace('+', ' ')

    # Short-circuit if we can
    if '%' not in decoded_uri:
        return decoded_uri

    # NOTE(kgriffs): Clients should never submit a URI that has
    # unescaped non-ASCII chars in them, but just in case they
    # do, let's encode into a non-lossy format.
    decoded_uri = decoded_uri.encode('utf-8')

    # PERF(kgriffs): This was found to be faster than using
    # a regex sub call or list comprehension with a join.
    tokens = decoded_uri.split(b'%')
    bytes = io.BytesIO()
    bytes.write(tokens[0])
    # decoded_uri = tokens[0]
    for token in tokens[1:]:
        token_partial = token[:2]
        try:
            bytes.write(_HEX_TO_BYTE[token_partial])
            bytes.write(token[2:])
            # decoded_uri += _HEX_TO_BYTE[token_partial] + token[2:]
        except KeyError:
            # malformed percentage like "x=%" or "y=%+"
            # decoded_uri += b'%' + token
            bytes.write(b'%')
            bytes.write(token)

    # Convert back to str
    # return decoded_uri.decode('utf-8', 'replace')
    return bytes.getvalue().decode('utf-8', 'replace')

And the time costs before optimizing is python3.7 test.py 2019-10-27 23:25:38,030 urllib3.connectionpool: DEBUG Starting new HTTP connection (1): 127.0.0.1:8080 2019-10-27 23:26:04,819 urllib3.connectionpool: DEBUG http://127.0.0.1:8080 "POST /api/import/postdata/async HTTP/1.1" 200 146 2019-10-27 23:26:04,820 main : DEBUG post data return: {"status":"ok","error_msg":"","successRecords":[],"failedRecords":[],"totalSuccessRecords":500,"totalFailedRecords":0,"timeElapsed":26.7830269337} 2019-10-27 23:26:04,821 main : DEBUG time elapsed: 27.266945600509644 After optimizing is: python3.7 test.py 2019-10-27 23:54:09,490 urllib3.connectionpool: DEBUG Starting new HTTP connection (1): 127.0.0.1:8080 2019-10-27 23:54:09,791 urllib3.connectionpool: DEBUG http://127.0.0.1:8080 "POST /api/import/postdata/async HTTP/1.1" 200 145 2019-10-27 23:54:09,794 main : DEBUG post data return: {"status":"ok","error_msg":"","successRecords":[],"failedRecords":[],"totalSuccessRecords":500,"totalFailedRecords":0,"timeElapsed":0.2936162949} 2019-10-27 23:54:09,794 main : DEBUG time elapsed: 0.862856388092041

As you can see, there is a great improvement

JerryKwan commented 5 years ago

@vytas7 Do you get a chance to look at this? thank you. If anything I can help, feel free to leave me a message, thank you

kgriffs commented 5 years ago

Hi @JerryKwan, thanks for looking into this. How does this change affect small form bodies? I want to make sure that the final solution is performant for both small and large forms. Esp. since this function is also used for query strings which typically only include a few fields.

JerryKwan commented 5 years ago

Hi @kgriffs Most of the time, It will have better performance compared with the old version. You can use the following snippets to do some tests.

import io
import time
import urllib
import urllib.parse

# NOTE(kgriffs): See also RFC 3986
_UNRESERVED = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'
               'abcdefghijklmnopqrstuvwxyz'
               '0123456789'
               '-._~')

# NOTE(kgriffs): See also RFC 3986
_DELIMITERS = ":/?#[]@!$&'()*+,;="
_ALL_ALLOWED = _UNRESERVED + _DELIMITERS

_HEX_DIGITS = '0123456789ABCDEFabcdef'

# This map construction is based on urllib's implementation
_HEX_TO_BYTE = dict(((a + b).encode(), bytes([int(a + b, 16)]))
                    for a in _HEX_DIGITS
                    for b in _HEX_DIGITS)

def decode_old(encoded_uri, unquote_plus=True):
    """Decodes percent-encoded characters in a URI or query string.
    This function models the behavior of `urllib.parse.unquote_plus`,
    albeit in a faster, more straightforward manner.
    Args:
        encoded_uri (str): An encoded URI (full or partial).
    Keyword Arguments:
        unquote_plus (bool): Set to ``False`` to retain any plus ('+')
            characters in the given string, rather than converting them to
            spaces (default ``True``). Typically you should set this
            to ``False`` when decoding any part of a URI other than the
            query string.
    Returns:
        str: A decoded URL. If the URL contains escaped non-ASCII
        characters, UTF-8 is assumed per RFC 3986.
    """

    decoded_uri = encoded_uri

    # PERF(kgriffs): Don't take the time to instantiate a new
    # string unless we have to.
    if '+' in decoded_uri and unquote_plus:
        decoded_uri = decoded_uri.replace('+', ' ')

    # Short-circuit if we can
    if '%' not in decoded_uri:
        return decoded_uri

    # NOTE(kgriffs): Clients should never submit a URI that has
    # unescaped non-ASCII chars in them, but just in case they
    # do, let's encode into a non-lossy format.
    decoded_uri = decoded_uri.encode('utf-8')

    # PERF(kgriffs): This was found to be faster than using
    # a regex sub call or list comprehension with a join.
    tokens = decoded_uri.split(b'%')
    decoded_uri = tokens[0]
    for token in tokens[1:]:
        token_partial = token[:2]
        try:
            decoded_uri += _HEX_TO_BYTE[token_partial] + token[2:]
        except KeyError:
            # malformed percentage like "x=%" or "y=%+"
            decoded_uri += b'%' + token

    # Convert back to str
    return decoded_uri.decode('utf-8', 'replace')

def decode_new(encoded_uri, unquote_plus=True):
    """Decodes percent-encoded characters in a URI or query string.
    This function models the behavior of `urllib.parse.unquote_plus`,
    albeit in a faster, more straightforward manner.
    Args:
        encoded_uri (str): An encoded URI (full or partial).
    Keyword Arguments:
        unquote_plus (bool): Set to ``False`` to retain any plus ('+')
            characters in the given string, rather than converting them to
            spaces (default ``True``). Typically you should set this
            to ``False`` when decoding any part of a URI other than the
            query string.
    Returns:
        str: A decoded URL. If the URL contains escaped non-ASCII
        characters, UTF-8 is assumed per RFC 3986.
    """

    decoded_uri = encoded_uri

    # PERF(kgriffs): Don't take the time to instantiate a new
    # string unless we have to.
    if '+' in decoded_uri and unquote_plus:
        decoded_uri = decoded_uri.replace('+', ' ')

    # Short-circuit if we can
    if '%' not in decoded_uri:
        return decoded_uri

    # NOTE(kgriffs): Clients should never submit a URI that has
    # unescaped non-ASCII chars in them, but just in case they
    # do, let's encode into a non-lossy format.
    decoded_uri = decoded_uri.encode('utf-8')

    # PERF(kgriffs): This was found to be faster than using
    # a regex sub call or list comprehension with a join.
    tokens = decoded_uri.split(b'%')
    bytes = io.BytesIO()
    bytes.write(tokens[0])
    # decoded_uri = tokens[0]
    for token in tokens[1:]:
        token_partial = token[:2]
        try:
            bytes.write(_HEX_TO_BYTE[token_partial])
            bytes.write(token[2:])
            # decoded_uri += _HEX_TO_BYTE[token_partial] + token[2:]
        except KeyError:
            # malformed percentage like "x=%" or "y=%+"
            # decoded_uri += b'%' + token
            bytes.write(b'%')
            bytes.write(token)

    # Convert back to str
    # return decoded_uri.decode('utf-8', 'replace')
    return bytes.getvalue().decode('utf-8', 'replace')

def test(encoded_uri):
    print('len(encoded_uri) = ', len(encoded_uri))
    start = time.time()
    delta = decode_old(data)
    delta = time.time() - start
    print('decode_old elapsed: %s'%delta)
    start = time.time()
    delta = decode_new(data)
    delta = time.time() - start
    print('decode_new elapsed: %s'%delta)

def test_old_simple_short():
    data = '1234567890'
    delta = decode_old(data)
def test_new_simple_short():
    data = '1234567890'
    delta = decode_new(data)

def test_old_simple_long():
    data = '1234567890' * 100
    delta = decode_old(data)
def test_new_simple_long():
    data = '1234567890' * 100
    delta = decode_new(data)

def test_old_quote_plus_short():
    data = urllib.parse.quote_plus('http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg')
    delta = decode_old(data)
def test_new_quote_plus_short():
    data = urllib.parse.quote_plus('http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg')
    delta = decode_new(data)

def test_old_quote_plus_long():
    data = 'http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg' * 100
    data = urllib.parse.quote_plus(data)
    delta = decode_old(data)
def test_new_quote_plus_long():
    data = 'http://cdn-w.v12soft.com/photos/MhTz3d9/12082258/87407_1_dtm1235-2_172518_800600.jpg' * 100
    data = urllib.parse.quote_plus(data)
    delta = decode_new(data)

if __name__ == '__main__':
    import timeit
    print('test short normal values using decode_old()')
    print(timeit.timeit("test_old_simple_short()",  setup="from __main__ import test_old_simple_short", number=1000))
    print('test short normal values using decode_new()')
    print(timeit.timeit("test_new_simple_short()",  setup="from __main__ import test_new_simple_short", number=1000))

    print('test long normal values using decode_old()')
    print(timeit.timeit("test_old_simple_long()",  setup="from __main__ import test_old_simple_long", number=1000))
    print('test long normal values using decode_new()')
    print(timeit.timeit("test_new_simple_long()",  setup="from __main__ import test_new_simple_long", number=1000))

    print('test short quote plus values using decode_old()')
    print(timeit.timeit("test_old_quote_plus_short()",  setup="from __main__ import test_old_quote_plus_short", number=1000))
    print('test short quote plus values using decode_new()')
    print(timeit.timeit("test_new_quote_plus_short()",  setup="from __main__ import test_new_quote_plus_short", number=1000))

    print('test long quote plus values using decode_old()')
    print(timeit.timeit("test_old_quote_plus_long()",  setup="from __main__ import test_old_quote_plus_long",number=1000))
    print('test long quote plus values using decode_new()')
    print(timeit.timeit("test_new_quote_plus_long()",  setup="from __main__ import test_new_quote_plus_long",number=1000))

And this is the results from my end python3.7 test_encode.py test short normal values using decode_old() 0.00038655800744891167 test short normal values using decode_new() 0.00035922101233154535 test long normal values using decode_old() 0.0004725730395875871 test long normal values using decode_new() 0.00046735297655686736 test short quote plus values using decode_old() 0.027411639981437474 test short quote plus values using decode_new() 0.02820569701725617 test long quote plus values using decode_old() 1.7537030499661341 test long quote plus values using decode_new() 1.4266109309974127

vytas7 commented 5 years ago

Thanks @JerryKwan for investigation and benchmarks! I'll take a look at it too since I was working on modernizing form handling.

I agree our current approach is questionable since it may perform poorly under PyPy. CPython, on the other hand, is very optimized for string += against a local variable, but as I understand that probably hits the wall when going above Python allocator pool sizes, or something like that.

However, I find your approach using io.BytesIO quite unorthodox. Could you also benchmark appending bytestrings to a list, and then 'b'.join(that_list)? This is recommended in PyPy tuning guidelines as the first hand choice, and I think CPython recommends that for generic string concatenation, too.

To reiterate what @kgriffs said though, it is very unusual to use URL-encoded forms for large bodies, so our focus should be on small URLs/forms. But I definitely agree that even for large forms our handling should not blow up in the O(n²) fashion (which might be happening here from what I understand).

kgriffs commented 5 years ago

If necessary, we could try to check the size of encoded_uri and switch the algorithm based on that.

For small numbers of chunks, concatenation (+=) tends to be more performant vs b''.join(). IIRC, this was the case even for 2.6 and 2.7, before a lot of optimization was done in the runtime around that. Regardless, it would be interesting to benchmark the BytesIO approach against b''.join().

And we need to make sure we check all proposed solutions against 3.5,3.6,3.7 and both CPython and PyPy.

vytas7 commented 5 years ago

I was benchmarking generic bytestring concatenation along the lines:

import io
import timeit

DATA = """
What People are Saying
We have been using Falcon as a replacement for [framework] and we simply love
the performance (three times faster) and code base size (easily half of our
original [framework] code).
Falcon looks great so far. I hacked together a quick test for a tiny server of
mine and was ~40% faster with only 20 minutes of work.” “Falcon is rock solid
and it’s fast.
I’m loving #falconframework! Super clean and simple, I finally have the speed
and flexibility I need!
I feel like I’m just talking HTTP at last, with nothing in the middle. Falcon
seems like the requests of backend.
The source code for Falcon is so good, I almost prefer it to documentation. It
basically can’t be wrong.
What other framework has integrated support for 786 TRY IT NOW ?
"""

ITEMS = tuple(word.encode() for word in DATA.split() if len(word) > 2) * 1024
print(len(ITEMS))

def benchmark_concat(count, number=int(1e5)):
    def approach1():
        result = b''
        for item in items:
            result += item
        return result

    def approach2():
        result = []
        for item in items:
            result.append(item)
        return b''.join(result)

    def approach3():
        return b''.join(item for item in items)

    def approach4():
        result = io.BytesIO()
        for item in items:
            result.write(item)
        return result.getvalue()

    items = ITEMS[:count]

    # Warm up functions
    timeit.timeit(approach1, number=number)
    timeit.timeit(approach2, number=number)
    timeit.timeit(approach3, number=number)
    timeit.timeit(approach4, number=number)

    print(
        count,
        (
            timeit.timeit(approach1, number=number),
            timeit.timeit(approach2, number=number),
            timeit.timeit(approach3, number=number),
            timeit.timeit(approach4, number=number)
        ))

for count in (3, 5, 8, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100):
    benchmark_concat(count)

benchmark_concat(1000, 10000)
benchmark_concat(10000, 1000)
benchmark_concat(100000, 10)

The output

python3.7 bench.py 
110592
3 (0.02144782803952694, 0.031356992200016975, 0.044225919991731644, 0.04343338776379824)
5 (0.028610989917069674, 0.04261731030419469, 0.05122442310675979, 0.0607671239413321)
8 (0.04888148419559002, 0.06251737708225846, 0.06283955089747906, 0.07964882394298911)
10 (0.051398238632828, 0.0689754388295114, 0.0680475695990026, 0.09300704440101981)
15 (0.07813044590875506, 0.09609771007671952, 0.09159524599090219, 0.13005621312186122)
20 (0.09800408175215125, 0.13165172236040235, 0.11964801093563437, 0.15450798673555255)
30 (0.13904607575386763, 0.16506760893389583, 0.14640652015805244, 0.21083400398492813)
40 (0.1961963172070682, 0.2203828631900251, 0.18562251422554255, 0.26916765607893467)
50 (0.2484002630226314, 0.2684812028892338, 0.21915306989103556, 0.32866898318752646)
60 (0.2976999832317233, 0.31331878202036023, 0.25735463527962565, 0.3843465121462941)
70 (0.353005716111511, 0.3602594337426126, 0.29455818282440305, 0.43513593077659607)
80 (0.41087261587381363, 0.40120164630934596, 0.33170615369454026, 0.4951352826319635)
90 (0.47160040121525526, 0.45098415901884437, 0.36671631317585707, 0.5520307668484747)
100 (0.557110704947263, 0.4945222963578999, 0.4039875380694866, 0.6070281830616295)
1000 (1.151520217768848, 0.46339548798277974, 0.3604183937422931, 0.5241292156279087)
10000 (6.590064087882638, 0.45619268203154206, 0.360421403311193, 0.5087128123268485)
100000 (8.545706522185355, 0.052082180976867676, 0.04151772195473313, 0.0503646950237453)

Some takes from it, based on the amount of items (short English words in this case):

For a small amount of items (up to 60-70) local_var += performs best
For larger amount, bytestring list concatenation performs best
For very large amount (10-100K+), local_var += starts blowing up clearly worse than O(n), and interestingly, a BytesIO starts outperforming a list, albeit very slightly.

The discussion above does not include b''.join(iterator) which starts outperforming other approaches at ~40 items. However, depending on logic, it is not always possible to convert anything to iteration without extra overhead. But for simple iterations it works great it seems.

vytas7 commented 5 years ago

Here comes the same benchmark for PyPy3.6:

3 (0.009178994689136744, 0.009376228787004948, 0.018578666262328625, 0.013394333887845278)
5 (0.01192251406610012, 0.01242155022919178, 0.023621411062777042, 0.01531446073204279)
8 (0.012679930310696363, 0.01512245973572135, 0.030060097109526396, 0.016992149874567986)
10 (0.014907791744917631, 0.019190983846783638, 0.03503640089184046, 0.019363478291779757)
15 (0.019128676038235426, 0.023180575110018253, 0.04538170574232936, 0.02520340494811535)
20 (0.02397815603762865, 0.029557771049439907, 0.059623430017381907, 0.032457142136991024)
30 (0.0349110858514905, 0.040592003148049116, 0.08166458597406745, 0.04336934583261609)
40 (0.04713146481662989, 0.05265957396477461, 0.10618857899680734, 0.05350310308858752)
50 (0.05956967966631055, 0.06458686897531152, 0.12860003719106317, 0.06669273413717747)
60 (0.07337774988263845, 0.07615477964282036, 0.15194117790088058, 0.07913599396124482)
70 (0.08826563088223338, 0.08873719209805131, 0.17371911695227027, 0.08756348071619868)
80 (0.10602666484192014, 0.10031882021576166, 0.20231699803844094, 0.09858379233628511)
90 (0.1267012502066791, 0.11499356525018811, 0.227923349943012, 0.11141298478469253)
100 (0.14604800380766392, 0.12487248610705137, 0.24984144000336528, 0.12241428019478917)
1000 (0.8727836729958653, 0.12858159141615033, 0.23260656977072358, 0.11514023086056113)
10000 (9.508226448204368, 0.1423673778772354, 0.2589253569021821, 0.11189583828672767)
100000 (26.066565764136612, 0.017521890345960855, 0.03390416409820318, 0.011629929300397635)

On PyPy, an iterator works worse, and a list pays off earlier. local_var += explodes even worse. What is even more interesting, BytesIO (as suggested by @JerryKwan ) gets pretty performant quite early on, something I was not expecting on PyPy.

vytas7 commented 5 years ago

Regarding decode and parse_query_string, I'll look into reimplementing them in Cython.

Regarding concatenating bytestrings, I found an interesting note on newest 3.x CPython docs, they recommend bytearray_var += item, although mention BytesIO too as a good all-rounder. I'll add the bytearray approach to the benchmark above.

vytas7 commented 5 years ago

FWIW, bytearray_var += approach outdoes anything else on CPython 3.7 indeed (except for low count of items). It is less clear on PyPy though. I'll post results later.

JerryKwan commented 5 years ago

@vytas7 Would you please share the test code? thank you

vytas7 commented 5 years ago

@JerryKwan the test is inlined in one of my earlier comments. I'll update it to reflect the bytearray test too.

kgriffs commented 5 years ago

Thanks @JerryKwan @vytas7 for digging into this!

JerryKwan commented 5 years ago

@vytas7 , what I mean is the bytearray test codes. And I can change my codes to do some tests in my environment using real data.

vytas7 commented 5 years ago

Ah, sorry, here you are (the updated version)!

import io
import timeit

DATA = """
What People are Saying
We have been using Falcon as a replacement for [framework] and we simply love
the performance (three times faster) and code base size (easily half of our
original [framework] code).
Falcon looks great so far. I hacked together a quick test for a tiny server of
mine and was ~40% faster with only 20 minutes of work.” “Falcon is rock solid
and it’s fast.
I’m loving #falconframework! Super clean and simple, I finally have the speed
and flexibility I need!
I feel like I’m just talking HTTP at last, with nothing in the middle. Falcon
seems like the requests of backend.
The source code for Falcon is so good, I almost prefer it to documentation. It
basically can’t be wrong.
What other framework has integrated support for 786 TRY IT NOW ?
"""

ITEMS = tuple(word.encode() for word in DATA.split() if len(word) > 2) * 1024
print(len(ITEMS))

def benchmark_concat(count, number=int(1e5)):
    def approach1():
        result = b''
        for item in items:
            result += item
        return result

    def approach2():
        result = []
        for item in items:
            result.append(item)
        return b''.join(result)

    def approach3():
        return b''.join(item for item in items)

    def approach4():
        result = io.BytesIO()
        for item in items:
            result.write(item)
        return result.getvalue()

    def approach5():
        result = bytearray()
        for item in items:
            result += item
        return bytes(result)

    items = ITEMS[:count]

    # Warm up functions
    timeit.timeit(approach1, number=number)
    timeit.timeit(approach2, number=number)
    timeit.timeit(approach3, number=number)
    timeit.timeit(approach4, number=number)
    timeit.timeit(approach5, number=number)

    print(
        count,
        (
            timeit.timeit(approach1, number=number),
            timeit.timeit(approach2, number=number),
            timeit.timeit(approach3, number=number),
            timeit.timeit(approach4, number=number),
            timeit.timeit(approach5, number=number),
        ))

for count in (3, 5, 8, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100):
    benchmark_concat(count)

benchmark_concat(1000, 10000)
benchmark_concat(10000, 1000)
benchmark_concat(100000, 10)

JerryKwan commented 5 years ago

@vytas7 , thank you for providing the snipptes. Here is my testing result

 test_encode.py
test short normal values using decode_old()| decode_bytesio()| decode_bytearray()
0.00036158598959445953 0.00042326096445322037 0.00036234501749277115
test long normal values using decode_old()| decode_bytesio()| decode_bytearray()
0.0005342649528756738 0.0005105879390612245 0.0004338329890742898
test short quote plus values using decode_old()| decode_bytesio()| decode_bytearray()
0.02455015096347779 0.033037879038602114 0.026454834965988994
test long quote plus values using decode_old()| decode_bytesio()| decode_bytearray()
1.8134915790287778 1.5716652729315683 1.4036276719998568

As you can see, bytesarray is the most performant way for processing huge items concatenation

vytas7 commented 5 years ago

@JerryKwan I have some proof of concept implementations almost working in Cython, I hope to create a PR sometime soon. If that works we can optimize the non-Cythonized version to be a good all-rounder that also performs on PyPy.

JerryKwan commented 5 years ago

@vytas7, that's really great! Anything I can help, feel free to let me know, thank you

vytas7 commented 5 years ago

@JerryKwan looking at your GH profile, you're from China, right? How are HTML forms and web APIs normally looking in China? Do you use English URI patterns and query params, or are they normally in Chinese too? Regardless of details though, I feel this issue highlights that our development/ci/testing is somewhat ASCII-centric. I'll think of how we could improve this. @kgriffs thoughts?

JerryKwan commented 5 years ago

@vytas7 Yes, I am living in China right now. Generally speaking, we use English URI patterns, and the query keys are in English too, but the query values maybe Chinese. Would you please give me your email address? thank you

vytas7 commented 5 years ago

@JerryKwan Could I get your form example to run some tests on?

I have rewritten the thing in Cython and it is hopefully starting to shape up, with the following caveats:

It is not integrated into Falcon yet, so you need to monkey patch yourself, or just test the falcon.cyutil.uri.parse_query_string function
It does not support the csv option (yet)
The decode() routine is not exposed separately, and if it was, it would not support switching off '+' :arrow_right: ' ' (yet)

You can check it out yourself too: https://github.com/vytas7/falcon/tree/cython-parse-query-string

vytas7 commented 5 years ago

@JerryKwan could you please retest with the Cythonized version of https://github.com/vytas7/falcon/tree/cython-parse-query-string (https://github.com/falconry/falcon/pull/1604) ?

I have not fixed it for the Python version yet.

But you could force cythonization and check if this solves your performances issues: (in your virtual environment considering you are using CPython, PyPy is still unaddressed)

pip install Cython
pip install --no-use-pep517 git+https://github.com/vytas7/falcon@cython-parse-query-string

Personally I'm seeing an improvement I believe when testing with the Cython version as per instructions above:

>>> from falcon.uri import decode
>>> import timeit
>>> DATA = 'Hello%20World%20just%20testing' * 10000
>>> timeit.timeit((lambda: decode(DATA)), number=10)
0.006680036000034306
>>> timeit.timeit((lambda: decode(DATA)), number=1000)
0.23197350000009465
>>>

Installing vanilla Falcon 2.0 from PyPi:

>>> from falcon.uri import decode
>>> import timeit
>>> DATA = 'Hello%20World%20just%20testing' * 10000
>>> timeit.timeit((lambda: decode(DATA)), number=10)
0.9825585139988107
>>> timeit.timeit((lambda: decode(DATA)), number=1000)
... didn't wait for ...

One could grab a :coffee: and compute the last value :arrow_upper_left:

JerryKwan commented 5 years ago

@vytas7 This is the test result from my end, using the same data 1, using

pip install Cython
pip install Falcon

[root@host ~]# python3.7 test_decode.py 
23.170549132861197

2, using

pip install Cython, 
pip install --no-use-pep517 git+https://github.com/vytas7/falcon@cython-parse-query-string

[root@host ~]# python3.7 test_decode.py 
0.010890404228121042

As you can see, your new version looks really promising.

kgriffs commented 5 years ago

When Cython is not available in the environment, I like the idea of branching based on len(encoded_uri) as well as PyPy vs. CPython vs. (since the length threshold is different, and the best-performaing approach is as well).

falconry / falcon

Low performance in req._parse_form_urlencoded() #1594