ICRAR / ijson

Iterative JSON parser with Pythonic interfaces
http://pypi.python.org/pypi/ijson/
Other
852 stars 51 forks source link

Is the yajl_c backend supported on PyPy? #82

Closed jpmckinney closed 1 year ago

jpmckinney commented 1 year ago

I think yajl2_cffi worked for me, last time I tested, but yajl_c was causing C errors.

The docs mention "python: pure Python parser, good to use with PyPy"

Do you happen to know the difference in performance between a YAJL and pure-Python backend when using PyPy?

Also, should the backend selection code use different options/ordering on PyPy?

https://github.com/ICRAR/ijson/blob/a62c4b35d58775fbedd0308b4685f1b497a7a917/ijson/__init__.py#L30

rtobar commented 1 year ago

@jpmckinney all good questions! So:

jpmckinney commented 1 year ago

Thanks!

That jogged my memory a bit – I do something unusual in my code, where I build a dict in which some values are generators. I then use this code when I need to serialize the dict to JSON.

https://github.com/open-contracting/ocdskit/blob/9984b80b524c0a57222f10f76a209bc906c09799/ocdskit/util.py#L40-L68

Somewhere in there, the combination of generators and ijson caused a C error.

Anyway, I'm trying to reproduce it now, but I can't get pip to find YAJL headers when using PyPy (I can use the yajl_c backend in CPython, but I think it's included in the wheel). python -c 'import ijson; print(ijson.backend)' just returns 'python' in my PyPy environment.

rtobar commented 1 year ago

That's interesting about the backends available to you. I just double-checked one the latest I published just the other day for ijson 3.2.0 under https://pypi.org/project/ijson/#files (pypy39, manylinux, x86_64) and it contained both the compiled yajl library and the yajl2_c backends. I also gave it a quick wirl:

$ sudo apt install pypy3-venv
$> pypy3 -mvenv lala
$> source lala/bin/activate
(lala) $ pypy -c 'import ijson; print(ijson.backend)'
yajl2_c

Boom!

And as a tiny benchmark:

(lala) $ cp ~/scm/git/ijson/benchmark.py . # otherwise it uses *that* copy of ijson and doesn't load all backends properly
(lala) $ pypy benchmark.py 
#mbytes,method,test_case,backend,time,mb_per_sec
0.191, basic_parse, long_list, python, 0.036, 5.326
0.191, basic_parse, long_list, yajl2, 0.196, 0.973
0.191, basic_parse, long_list, yajl2_cffi, 0.030, 6.262
0.191, basic_parse, long_list, yajl2_c, 0.062, 3.061
1.886, basic_parse, big_int_object, python, 0.107, 17.704
1.886, basic_parse, big_int_object, yajl2, 0.319, 5.905
1.886, basic_parse, big_int_object, yajl2_cffi, 0.054, 35.115
1.886, basic_parse, big_int_object, yajl2_c, 0.146, 12.930
2.077, basic_parse, big_decimal_object, python, 0.236, 8.783
2.077, basic_parse, big_decimal_object, yajl2, 0.379, 5.475
2.077, basic_parse, big_decimal_object, yajl2_cffi, 0.100, 20.775
2.077, basic_parse, big_decimal_object, yajl2_c, 0.332, 6.248
1.801, basic_parse, big_null_object, python, 0.094, 19.090
1.801, basic_parse, big_null_object, yajl2, 0.273, 6.598
1.801, basic_parse, big_null_object, yajl2_cffi, 0.040, 44.615
1.801, basic_parse, big_null_object, yajl2_c, 0.101, 17.829
1.849, basic_parse, big_bool_object, python, 0.078, 23.842
1.849, basic_parse, big_bool_object, yajl2, 0.288, 6.426
1.849, basic_parse, big_bool_object, yajl2_cffi, 0.044, 42.343
1.849, basic_parse, big_bool_object, yajl2_c, 0.096, 19.163
2.649, basic_parse, big_str_object, python, 0.095, 27.807
2.649, basic_parse, big_str_object, yajl2, 0.353, 7.501
2.649, basic_parse, big_str_object, yajl2_cffi, 0.057, 46.466
2.649, basic_parse, big_str_object, yajl2_c, 0.147, 18.059
8.000, basic_parse, big_longstr_object, python, 0.146, 54.769
8.000, basic_parse, big_longstr_object, yajl2, 0.480, 16.654
8.000, basic_parse, big_longstr_object, yajl2_cffi, 0.057, 141.468
8.000, basic_parse, big_longstr_object, yajl2_c, 0.164, 48.791
19.264, basic_parse, object_with_10_keys, python, 0.764, 25.209
19.264, basic_parse, object_with_10_keys, yajl2, 3.049, 6.318
19.264, basic_parse, object_with_10_keys, yajl2_cffi, 0.461, 41.819
19.264, basic_parse, object_with_10_keys, yajl2_c, 1.902, 10.128
0.381, basic_parse, empty_lists, python, 0.036, 10.482
0.381, basic_parse, empty_lists, yajl2, 0.113, 3.375
0.381, basic_parse, empty_lists, yajl2_cffi, 0.026, 14.803
0.381, basic_parse, empty_lists, yajl2_c, 0.051, 7.532
0.381, basic_parse, empty_objects, python, 0.021, 18.226
0.381, basic_parse, empty_objects, yajl2, 0.282, 1.355
0.381, basic_parse, empty_objects, yajl2_cffi, 0.022, 17.367
0.381, basic_parse, empty_objects, yajl2_c, 0.050, 7.614

So cffi seems to be the winner in this case.

It'd be good to see more evidence that gives these backends a natural sorting order in which we can recommend them under pypy.

For reference, this is the same benchmark with CPython 3.10:

(ijson) $ python benchmark.py 
#mbytes,method,test_case,backend,time,mb_per_sec
0.191, basic_parse, long_list, python, 0.154, 1.235
0.191, basic_parse, long_list, yajl2, 0.091, 2.093
0.191, basic_parse, long_list, yajl2_cffi, 0.089, 2.154
0.191, basic_parse, long_list, yajl2_c, 0.008, 24.960
1.886, basic_parse, big_int_object, python, 0.327, 5.764
1.886, basic_parse, big_int_object, yajl2, 0.177, 10.642
1.886, basic_parse, big_int_object, yajl2_cffi, 0.167, 11.311
1.886, basic_parse, big_int_object, yajl2_c, 0.017, 107.875
2.077, basic_parse, big_decimal_object, python, 0.343, 6.053
2.077, basic_parse, big_decimal_object, yajl2, 0.192, 10.839
2.077, basic_parse, big_decimal_object, yajl2_cffi, 0.177, 11.746
2.077, basic_parse, big_decimal_object, yajl2_c, 0.028, 74.584
1.801, basic_parse, big_null_object, python, 0.270, 6.667
1.801, basic_parse, big_null_object, yajl2, 0.101, 17.869
1.801, basic_parse, big_null_object, yajl2_cffi, 0.111, 16.208
1.801, basic_parse, big_null_object, yajl2_c, 0.014, 131.166
1.849, basic_parse, big_bool_object, python, 0.272, 6.803
1.849, basic_parse, big_bool_object, yajl2, 0.106, 17.429
1.849, basic_parse, big_bool_object, yajl2_cffi, 0.117, 15.738
1.849, basic_parse, big_bool_object, yajl2_c, 0.026, 70.817
2.649, basic_parse, big_str_object, python, 0.312, 8.488
2.649, basic_parse, big_str_object, yajl2, 0.151, 17.525
2.649, basic_parse, big_str_object, yajl2_cffi, 0.142, 18.710
2.649, basic_parse, big_str_object, yajl2_c, 0.016, 163.509
8.000, basic_parse, big_longstr_object, python, 0.323, 24.801
8.000, basic_parse, big_longstr_object, yajl2, 0.153, 52.134
8.000, basic_parse, big_longstr_object, yajl2_cffi, 0.143, 56.138
8.000, basic_parse, big_longstr_object, yajl2_c, 0.016, 510.421
19.264, basic_parse, object_with_10_keys, python, 3.236, 5.954
19.264, basic_parse, object_with_10_keys, yajl2, 1.582, 12.178
19.264, basic_parse, object_with_10_keys, yajl2_cffi, 1.490, 12.932
19.264, basic_parse, object_with_10_keys, yajl2_c, 0.168, 114.446
0.381, basic_parse, empty_lists, python, 0.159, 2.398
0.381, basic_parse, empty_lists, yajl2, 0.041, 9.251
0.381, basic_parse, empty_lists, yajl2_cffi, 0.073, 5.217
0.381, basic_parse, empty_lists, yajl2_c, 0.010, 36.912
0.381, basic_parse, empty_objects, python, 0.160, 2.390
0.381, basic_parse, empty_objects, yajl2, 0.041, 9.342
0.381, basic_parse, empty_objects, yajl2_cffi, 0.073, 5.203
0.381, basic_parse, empty_objects, yajl2_c, 0.010, 36.672
jpmckinney commented 1 year ago

Ah, I'm on macos arm64, so that might be the reason – there's no arm 64 wheel for PyPy on macos.

So it looks like on PyPy (on that benchmark): _cffi > python > yajl2 > _c.

That said, yajl_c on CPython seems fastest all around.

rtobar commented 1 year ago

Yes, that seems to be more or less the order. Still I'd hesitate to make a decision based on those alone; if you (or someone else) could provide more real-life numbers it'd be great -- things might be different on a macos arm64 for example.

jpmckinney commented 1 year ago

I probably won't be able to, as I can't figure out how to make ijson find YAJL headers on PyPy. Feel free to close the issue.

rtobar commented 1 year ago

OK, thanks for the feedback! I'll close this now, but this issue should be a good reference for future PyPy users.