Is the yajl_c backend supported on PyPy?

jpmckinney commented 1 year ago

I think yajl2_cffi worked for me, last time I tested, but yajl_c was causing C errors.

The docs mention "python: pure Python parser, good to use with PyPy"

Do you happen to know the difference in performance between a YAJL and pure-Python backend when using PyPy?

Also, should the backend selection code use different options/ordering on PyPy?

https://github.com/ICRAR/ijson/blob/a62c4b35d58775fbedd0308b4685f1b497a7a917/ijson/__init__.py#L30

rtobar commented 1 year ago

@jpmckinney all good questions! So:

I know it compiles with PyPy (we even publish binary wheels provided by cibuildwheel). About running it: I'm not a PyPy user myself, so I've only ever tried it sporadically to check that the tests run, and they have. I don't do it very often though, so there could be issues that I don't know of (but the CI tests that run when building the wheels always pass with PyPy).
I don't know about performance, I've never properly measured it (again, no by PyPy user). Since the Python C API impl for PyPy is slower than for CPython I'd assume yajl2_c runs slower in PyPy than in CPython, but I don't know how it fares with respect to the other backends. If you are willing to put some numbers in I'd be interested to see them, and potentially change the default order in which backends are loaded to determine the default on PyPy. You can use the benchmark.py utility in the top-level directory to run one of the built-in synthetic scenarios, or against your own JSON files.
I think the "python: pure Python parser, good to use with PyPy" phrase is mostly a historical relic, as all backends should be good to use with PyPy really. Based on the benchmark results it could still be that this is the fastest in PyPy though.

jpmckinney commented 1 year ago

Thanks!

That jogged my memory a bit – I do something unusual in my code, where I build a dict in which some values are generators. I then use this code when I need to serialize the dict to JSON.

https://github.com/open-contracting/ocdskit/blob/9984b80b524c0a57222f10f76a209bc906c09799/ocdskit/util.py#L40-L68

Somewhere in there, the combination of generators and ijson caused a C error.

Anyway, I'm trying to reproduce it now, but I can't get pip to find YAJL headers when using PyPy (I can use the yajl_c backend in CPython, but I think it's included in the wheel). python -c 'import ijson; print(ijson.backend)' just returns 'python' in my PyPy environment.

rtobar commented 1 year ago

That's interesting about the backends available to you. I just double-checked one the latest I published just the other day for ijson 3.2.0 under https://pypi.org/project/ijson/#files (pypy39, manylinux, x86_64) and it contained both the compiled yajl library and the yajl2_c backends. I also gave it a quick wirl:

$ sudo apt install pypy3-venv
$> pypy3 -mvenv lala
$> source lala/bin/activate
(lala) $ pypy -c 'import ijson; print(ijson.backend)'
yajl2_c

Boom!

And as a tiny benchmark:

(lala) $ cp ~/scm/git/ijson/benchmark.py . # otherwise it uses *that* copy of ijson and doesn't load all backends properly
(lala) $ pypy benchmark.py 
#mbytes,method,test_case,backend,time,mb_per_sec
0.191, basic_parse, long_list, python, 0.036, 5.326
0.191, basic_parse, long_list, yajl2, 0.196, 0.973
0.191, basic_parse, long_list, yajl2_cffi, 0.030, 6.262
0.191, basic_parse, long_list, yajl2_c, 0.062, 3.061
1.886, basic_parse, big_int_object, python, 0.107, 17.704
1.886, basic_parse, big_int_object, yajl2, 0.319, 5.905
1.886, basic_parse, big_int_object, yajl2_cffi, 0.054, 35.115
1.886, basic_parse, big_int_object, yajl2_c, 0.146, 12.930
2.077, basic_parse, big_decimal_object, python, 0.236, 8.783
2.077, basic_parse, big_decimal_object, yajl2, 0.379, 5.475
2.077, basic_parse, big_decimal_object, yajl2_cffi, 0.100, 20.775
2.077, basic_parse, big_decimal_object, yajl2_c, 0.332, 6.248
1.801, basic_parse, big_null_object, python, 0.094, 19.090
1.801, basic_parse, big_null_object, yajl2, 0.273, 6.598
1.801, basic_parse, big_null_object, yajl2_cffi, 0.040, 44.615
1.801, basic_parse, big_null_object, yajl2_c, 0.101, 17.829
1.849, basic_parse, big_bool_object, python, 0.078, 23.842
1.849, basic_parse, big_bool_object, yajl2, 0.288, 6.426
1.849, basic_parse, big_bool_object, yajl2_cffi, 0.044, 42.343
1.849, basic_parse, big_bool_object, yajl2_c, 0.096, 19.163
2.649, basic_parse, big_str_object, python, 0.095, 27.807
2.649, basic_parse, big_str_object, yajl2, 0.353, 7.501
2.649, basic_parse, big_str_object, yajl2_cffi, 0.057, 46.466
2.649, basic_parse, big_str_object, yajl2_c, 0.147, 18.059
8.000, basic_parse, big_longstr_object, python, 0.146, 54.769
8.000, basic_parse, big_longstr_object, yajl2, 0.480, 16.654
8.000, basic_parse, big_longstr_object, yajl2_cffi, 0.057, 141.468
8.000, basic_parse, big_longstr_object, yajl2_c, 0.164, 48.791
19.264, basic_parse, object_with_10_keys, python, 0.764, 25.209
19.264, basic_parse, object_with_10_keys, yajl2, 3.049, 6.318
19.264, basic_parse, object_with_10_keys, yajl2_cffi, 0.461, 41.819
19.264, basic_parse, object_with_10_keys, yajl2_c, 1.902, 10.128
0.381, basic_parse, empty_lists, python, 0.036, 10.482
0.381, basic_parse, empty_lists, yajl2, 0.113, 3.375
0.381, basic_parse, empty_lists, yajl2_cffi, 0.026, 14.803
0.381, basic_parse, empty_lists, yajl2_c, 0.051, 7.532
0.381, basic_parse, empty_objects, python, 0.021, 18.226
0.381, basic_parse, empty_objects, yajl2, 0.282, 1.355
0.381, basic_parse, empty_objects, yajl2_cffi, 0.022, 17.367
0.381, basic_parse, empty_objects, yajl2_c, 0.050, 7.614

So cffi seems to be the winner in this case.

It'd be good to see more evidence that gives these backends a natural sorting order in which we can recommend them under pypy.

For reference, this is the same benchmark with CPython 3.10:

(ijson) $ python benchmark.py 
#mbytes,method,test_case,backend,time,mb_per_sec
0.191, basic_parse, long_list, python, 0.154, 1.235
0.191, basic_parse, long_list, yajl2, 0.091, 2.093
0.191, basic_parse, long_list, yajl2_cffi, 0.089, 2.154
0.191, basic_parse, long_list, yajl2_c, 0.008, 24.960
1.886, basic_parse, big_int_object, python, 0.327, 5.764
1.886, basic_parse, big_int_object, yajl2, 0.177, 10.642
1.886, basic_parse, big_int_object, yajl2_cffi, 0.167, 11.311
1.886, basic_parse, big_int_object, yajl2_c, 0.017, 107.875
2.077, basic_parse, big_decimal_object, python, 0.343, 6.053
2.077, basic_parse, big_decimal_object, yajl2, 0.192, 10.839
2.077, basic_parse, big_decimal_object, yajl2_cffi, 0.177, 11.746
2.077, basic_parse, big_decimal_object, yajl2_c, 0.028, 74.584
1.801, basic_parse, big_null_object, python, 0.270, 6.667
1.801, basic_parse, big_null_object, yajl2, 0.101, 17.869
1.801, basic_parse, big_null_object, yajl2_cffi, 0.111, 16.208
1.801, basic_parse, big_null_object, yajl2_c, 0.014, 131.166
1.849, basic_parse, big_bool_object, python, 0.272, 6.803
1.849, basic_parse, big_bool_object, yajl2, 0.106, 17.429
1.849, basic_parse, big_bool_object, yajl2_cffi, 0.117, 15.738
1.849, basic_parse, big_bool_object, yajl2_c, 0.026, 70.817
2.649, basic_parse, big_str_object, python, 0.312, 8.488
2.649, basic_parse, big_str_object, yajl2, 0.151, 17.525
2.649, basic_parse, big_str_object, yajl2_cffi, 0.142, 18.710
2.649, basic_parse, big_str_object, yajl2_c, 0.016, 163.509
8.000, basic_parse, big_longstr_object, python, 0.323, 24.801
8.000, basic_parse, big_longstr_object, yajl2, 0.153, 52.134
8.000, basic_parse, big_longstr_object, yajl2_cffi, 0.143, 56.138
8.000, basic_parse, big_longstr_object, yajl2_c, 0.016, 510.421
19.264, basic_parse, object_with_10_keys, python, 3.236, 5.954
19.264, basic_parse, object_with_10_keys, yajl2, 1.582, 12.178
19.264, basic_parse, object_with_10_keys, yajl2_cffi, 1.490, 12.932
19.264, basic_parse, object_with_10_keys, yajl2_c, 0.168, 114.446
0.381, basic_parse, empty_lists, python, 0.159, 2.398
0.381, basic_parse, empty_lists, yajl2, 0.041, 9.251
0.381, basic_parse, empty_lists, yajl2_cffi, 0.073, 5.217
0.381, basic_parse, empty_lists, yajl2_c, 0.010, 36.912
0.381, basic_parse, empty_objects, python, 0.160, 2.390
0.381, basic_parse, empty_objects, yajl2, 0.041, 9.342
0.381, basic_parse, empty_objects, yajl2_cffi, 0.073, 5.203
0.381, basic_parse, empty_objects, yajl2_c, 0.010, 36.672

jpmckinney commented 1 year ago

Ah, I'm on macos arm64, so that might be the reason – there's no arm 64 wheel for PyPy on macos.

So it looks like on PyPy (on that benchmark): _cffi > python > yajl2 > _c.

That said, yajl_c on CPython seems fastest all around.

rtobar commented 1 year ago

Yes, that seems to be more or less the order. Still I'd hesitate to make a decision based on those alone; if you (or someone else) could provide more real-life numbers it'd be great -- things might be different on a macos arm64 for example.

jpmckinney commented 1 year ago

I probably won't be able to, as I can't figure out how to make ijson find YAJL headers on PyPy. Feel free to close the issue.

rtobar commented 1 year ago

OK, thanks for the feedback! I'll close this now, but this issue should be a good reference for future PyPy users.

ICRAR / ijson

Is the yajl_c backend supported on PyPy? #82