Closed jpmckinney closed 1 year ago
@jpmckinney all good questions! So:
yajl2_c
runs slower in PyPy than in CPython, but I don't know how it fares with respect to the other backends. If you are willing to put some numbers in I'd be interested to see them, and potentially change the default order in which backends are loaded to determine the default on PyPy. You can use the benchmark.py
utility in the top-level directory to run one of the built-in synthetic scenarios, or against your own JSON files.Thanks!
That jogged my memory a bit – I do something unusual in my code, where I build a dict in which some values are generators. I then use this code when I need to serialize the dict to JSON.
Somewhere in there, the combination of generators and ijson caused a C error.
Anyway, I'm trying to reproduce it now, but I can't get pip to find YAJL headers when using PyPy (I can use the yajl_c backend in CPython, but I think it's included in the wheel). python -c 'import ijson; print(ijson.backend)'
just returns 'python' in my PyPy environment.
That's interesting about the backends available to you. I just double-checked one the latest I published just the other day for ijson 3.2.0 under https://pypi.org/project/ijson/#files (pypy39, manylinux, x86_64) and it contained both the compiled yajl library and the yajl2_c backends. I also gave it a quick wirl:
$ sudo apt install pypy3-venv
$> pypy3 -mvenv lala
$> source lala/bin/activate
(lala) $ pypy -c 'import ijson; print(ijson.backend)'
yajl2_c
Boom!
And as a tiny benchmark:
(lala) $ cp ~/scm/git/ijson/benchmark.py . # otherwise it uses *that* copy of ijson and doesn't load all backends properly
(lala) $ pypy benchmark.py
#mbytes,method,test_case,backend,time,mb_per_sec
0.191, basic_parse, long_list, python, 0.036, 5.326
0.191, basic_parse, long_list, yajl2, 0.196, 0.973
0.191, basic_parse, long_list, yajl2_cffi, 0.030, 6.262
0.191, basic_parse, long_list, yajl2_c, 0.062, 3.061
1.886, basic_parse, big_int_object, python, 0.107, 17.704
1.886, basic_parse, big_int_object, yajl2, 0.319, 5.905
1.886, basic_parse, big_int_object, yajl2_cffi, 0.054, 35.115
1.886, basic_parse, big_int_object, yajl2_c, 0.146, 12.930
2.077, basic_parse, big_decimal_object, python, 0.236, 8.783
2.077, basic_parse, big_decimal_object, yajl2, 0.379, 5.475
2.077, basic_parse, big_decimal_object, yajl2_cffi, 0.100, 20.775
2.077, basic_parse, big_decimal_object, yajl2_c, 0.332, 6.248
1.801, basic_parse, big_null_object, python, 0.094, 19.090
1.801, basic_parse, big_null_object, yajl2, 0.273, 6.598
1.801, basic_parse, big_null_object, yajl2_cffi, 0.040, 44.615
1.801, basic_parse, big_null_object, yajl2_c, 0.101, 17.829
1.849, basic_parse, big_bool_object, python, 0.078, 23.842
1.849, basic_parse, big_bool_object, yajl2, 0.288, 6.426
1.849, basic_parse, big_bool_object, yajl2_cffi, 0.044, 42.343
1.849, basic_parse, big_bool_object, yajl2_c, 0.096, 19.163
2.649, basic_parse, big_str_object, python, 0.095, 27.807
2.649, basic_parse, big_str_object, yajl2, 0.353, 7.501
2.649, basic_parse, big_str_object, yajl2_cffi, 0.057, 46.466
2.649, basic_parse, big_str_object, yajl2_c, 0.147, 18.059
8.000, basic_parse, big_longstr_object, python, 0.146, 54.769
8.000, basic_parse, big_longstr_object, yajl2, 0.480, 16.654
8.000, basic_parse, big_longstr_object, yajl2_cffi, 0.057, 141.468
8.000, basic_parse, big_longstr_object, yajl2_c, 0.164, 48.791
19.264, basic_parse, object_with_10_keys, python, 0.764, 25.209
19.264, basic_parse, object_with_10_keys, yajl2, 3.049, 6.318
19.264, basic_parse, object_with_10_keys, yajl2_cffi, 0.461, 41.819
19.264, basic_parse, object_with_10_keys, yajl2_c, 1.902, 10.128
0.381, basic_parse, empty_lists, python, 0.036, 10.482
0.381, basic_parse, empty_lists, yajl2, 0.113, 3.375
0.381, basic_parse, empty_lists, yajl2_cffi, 0.026, 14.803
0.381, basic_parse, empty_lists, yajl2_c, 0.051, 7.532
0.381, basic_parse, empty_objects, python, 0.021, 18.226
0.381, basic_parse, empty_objects, yajl2, 0.282, 1.355
0.381, basic_parse, empty_objects, yajl2_cffi, 0.022, 17.367
0.381, basic_parse, empty_objects, yajl2_c, 0.050, 7.614
So cffi seems to be the winner in this case.
It'd be good to see more evidence that gives these backends a natural sorting order in which we can recommend them under pypy.
For reference, this is the same benchmark with CPython 3.10:
(ijson) $ python benchmark.py
#mbytes,method,test_case,backend,time,mb_per_sec
0.191, basic_parse, long_list, python, 0.154, 1.235
0.191, basic_parse, long_list, yajl2, 0.091, 2.093
0.191, basic_parse, long_list, yajl2_cffi, 0.089, 2.154
0.191, basic_parse, long_list, yajl2_c, 0.008, 24.960
1.886, basic_parse, big_int_object, python, 0.327, 5.764
1.886, basic_parse, big_int_object, yajl2, 0.177, 10.642
1.886, basic_parse, big_int_object, yajl2_cffi, 0.167, 11.311
1.886, basic_parse, big_int_object, yajl2_c, 0.017, 107.875
2.077, basic_parse, big_decimal_object, python, 0.343, 6.053
2.077, basic_parse, big_decimal_object, yajl2, 0.192, 10.839
2.077, basic_parse, big_decimal_object, yajl2_cffi, 0.177, 11.746
2.077, basic_parse, big_decimal_object, yajl2_c, 0.028, 74.584
1.801, basic_parse, big_null_object, python, 0.270, 6.667
1.801, basic_parse, big_null_object, yajl2, 0.101, 17.869
1.801, basic_parse, big_null_object, yajl2_cffi, 0.111, 16.208
1.801, basic_parse, big_null_object, yajl2_c, 0.014, 131.166
1.849, basic_parse, big_bool_object, python, 0.272, 6.803
1.849, basic_parse, big_bool_object, yajl2, 0.106, 17.429
1.849, basic_parse, big_bool_object, yajl2_cffi, 0.117, 15.738
1.849, basic_parse, big_bool_object, yajl2_c, 0.026, 70.817
2.649, basic_parse, big_str_object, python, 0.312, 8.488
2.649, basic_parse, big_str_object, yajl2, 0.151, 17.525
2.649, basic_parse, big_str_object, yajl2_cffi, 0.142, 18.710
2.649, basic_parse, big_str_object, yajl2_c, 0.016, 163.509
8.000, basic_parse, big_longstr_object, python, 0.323, 24.801
8.000, basic_parse, big_longstr_object, yajl2, 0.153, 52.134
8.000, basic_parse, big_longstr_object, yajl2_cffi, 0.143, 56.138
8.000, basic_parse, big_longstr_object, yajl2_c, 0.016, 510.421
19.264, basic_parse, object_with_10_keys, python, 3.236, 5.954
19.264, basic_parse, object_with_10_keys, yajl2, 1.582, 12.178
19.264, basic_parse, object_with_10_keys, yajl2_cffi, 1.490, 12.932
19.264, basic_parse, object_with_10_keys, yajl2_c, 0.168, 114.446
0.381, basic_parse, empty_lists, python, 0.159, 2.398
0.381, basic_parse, empty_lists, yajl2, 0.041, 9.251
0.381, basic_parse, empty_lists, yajl2_cffi, 0.073, 5.217
0.381, basic_parse, empty_lists, yajl2_c, 0.010, 36.912
0.381, basic_parse, empty_objects, python, 0.160, 2.390
0.381, basic_parse, empty_objects, yajl2, 0.041, 9.342
0.381, basic_parse, empty_objects, yajl2_cffi, 0.073, 5.203
0.381, basic_parse, empty_objects, yajl2_c, 0.010, 36.672
Ah, I'm on macos arm64, so that might be the reason – there's no arm 64 wheel for PyPy on macos.
So it looks like on PyPy (on that benchmark): _cffi > python > yajl2 > _c.
That said, yajl_c on CPython seems fastest all around.
Yes, that seems to be more or less the order. Still I'd hesitate to make a decision based on those alone; if you (or someone else) could provide more real-life numbers it'd be great -- things might be different on a macos arm64 for example.
I probably won't be able to, as I can't figure out how to make ijson find YAJL headers on PyPy. Feel free to close the issue.
OK, thanks for the feedback! I'll close this now, but this issue should be a good reference for future PyPy users.
I think yajl2_cffi worked for me, last time I tested, but yajl_c was causing C errors.
The docs mention "python: pure Python parser, good to use with PyPy"
Do you happen to know the difference in performance between a YAJL and pure-Python backend when using PyPy?
Also, should the backend selection code use different options/ordering on PyPy?
https://github.com/ICRAR/ijson/blob/a62c4b35d58775fbedd0308b4685f1b497a7a917/ijson/__init__.py#L30