amazon-ion / ion-python

A Python implementation of Amazon Ion.
https://amazon-ion.github.io/ion-docs/
Apache License 2.0
261 stars 51 forks source link

SimpleIon load performance: 7min on a 30Mb ion text files #66

Open chenliu0831 opened 6 years ago

chenliu0831 commented 6 years ago

Hi, it looks like there's a big performance issue with the simpleion load/load function. I did a little experiment:

The JSON load performance of the same data is ~550ms. IonCPP utils is also pretty fast for both text / binary. Any plan to speedup the dumper?

Thanks!

wesboyt commented 6 years ago

There are other dramatically faster APIs available for reading your data. They are less user friendly, but I would suggest using your own reader/writer instead of relying on the dom.

chenliu0831 commented 6 years ago

Well the dump/load is almost de-facto user facing API for the serialization library of JSON, Yaml, etc. I don't think it's a great idea to let the user write their own reader/writer.

zslayton commented 6 years ago

@chenliu0831 Can you share the code and compressed example data that you used for your test? That would be a big help for profiling.

wesboyt commented 6 years ago

I stand corrected, the standard APIs don't exist in ion-python, tgregg pointed me to: https://ion-python.readthedocs.io/en/latest/amazon.ion.html#module-amazon.ion.reader_text

It is an api in which you feed it events and it yields back to you with partial results, however it is probably just as slow as simpleion as it is an asynchronous coroutine.

tgregg commented 6 years ago

@chenliu0831 There is one main reason for the difference in performance between python's built-in JSON parser (simplejson) and ion-python's simpleion module: simplejson binds to a C extension, while simpleion is implemented purely in python. I started working on a C extension for ion-python a while ago and found that it effectively closed the performance gap. However, we haven't gotten around to finishing it yet.

chenliu0831 commented 6 years ago

@tgregg that's great to hear, looking forward to the gap being closed. I'll try to add some example data for testing in this issue.

I still think the "dom" based serialization might worth some more investment. Do we really expect every users to step in/out + next for their specific use-cases (which is not generic at all)? Best case some 3rd party will re-invent the load/dump implementation.

wesboyt commented 6 years ago

Absolutely perf improvement is a good goal. We get this same question a fair amount within ion java and the reality is that a dom is inherently slower than a streamlined reader/writer especially if you only care about a small subset of the data in a payload.

If all you are doing is translating text to binary before reading in python via load/dump you might find significant performance gains by using ion-c or ion java for that section of the logic. Then relying on ion python for your reading logic.

tgregg commented 5 years ago

I've pushed a branch that contains my in-progress work on a C extension here: https://github.com/amzn/ion-python/tree/c-extensions-experimental-debug

This code is all very rough since I was just trying to get something to work. Many cleanups will be required (make note of TODOs in the code and/or commit messages). But if someone needs a C extension in the near-term, this could serve as a starting point.

This should be rebased on top of the latest version of ion-python. I believe there are some small fixes to ion-python that are mixed in; those should be extracted into separate changes if they are still necessary. Also, ion-c has received many updates/fixes since this code was last touched.

The C extension prototype is contained within ioncmodule.c and its header _ioncmodule.h. ionc_ctypes.py contains some early experimentation and should be deleted.

The README contains a section describing what I had to do in order to get even a minimally-useful debugging environment set up for the C extension. Maybe the state-of-the-art in this area has improved in the past year and a half, but I don't know.

This branch also contains a blocking implementation of the binary reader and writer in blocking.py. The simpleion module can use this implementation when dumping/loading binary data (see the last three lines of simpleion for the code changes currently required to enable this). (Note: the same could be done for text.) This led to significant performance improvements compared to using the non-blocking API under the hood of simpleion (since most of the performance issues seem to be related to coroutine overhead). This improvement is less than a C extension would provide, but maybe it's good enough for some user cases. I suggest evaluating this first, because it's much simpler than a C extension.

If anyone wants to look into this, let me know. I'll do my best to consult as time allows.

chenliu0831 commented 5 years ago

@tgregg thanks for all the details and that's great to hear. As you suggested I might start cherry pick from the blocking.py and evaluate it in a branch based on current master.

Cheers

mrx23dot commented 3 years ago

As I measured ion is at least 5-10x slower than built in json loads/dumps. Takes many seconds to execute with only 300KB dictionary.

tgregg commented 3 years ago

There is an effort underway to back load(s)/dump(s) with a C extension (like the built-in JSON module), which will help close the performance gap.

https://github.com/amzn/ion-python/pull/149

stevenmanton commented 2 years ago

I see that the PR with the C extension was merged, but I'm still seeing very slow parsing using amazon.ion.simpleion.loads. As a comparison, I rewrote a subset of the Ion grammar in Lark and using the LALR parser is roughly 10x faster: 180ms instead of 1.6s to parse an ion string with 100k characters.

I installed amazon.ion from the wheel on PyPi. Is there a chance that the wheel isn't using the compiled C bindings?

cheqianh commented 2 years ago

Hi @stevenmanton, thank you for letting us know! There is a small issue that makes ion-python 0.9.0 fall back to the pure python implementation and we are working on it. It doesn't seem that we uploaded wheels to PYPI, did you mean tar.gz?

A little more configuration details here if you are interested:

For source distribution (tar.gz), we try to avoid uploading c binaries to PYPI but including all built binaries (from the script here) in the final directory after installation. However, in our current configuration logic here, we don't include c binaries so the C extension is not enabled by default.

stevenmanton commented 2 years ago

Thanks @chenliu0831! You're right, I meant the source distribution on PyPi.

I tried installing 0.8.0 from PyPi and I'm seeing similarly slow parsing. Are you saying that the performance issue is a regression or that the merged fixed isn't making its way into the installable Python package?

cheqianh commented 2 years ago

0.8.0 includes all changes before the C extension. 0.9.0 adds the C extension but it is not enabled by default. It's a configuration issue and need to be fixed.

One possible work around is checking out the package from Github repository directly (other than installing from PYPI) and set it up following this instruction. Please let me know if you have any question.