amazon-ion / ion-python

A Python implementation of Amazon Ion.
https://amazon-ion.github.io/ion-docs/
Apache License 2.0
260 stars 50 forks source link

Enhance the benchmark runner to support multiple top level objects use case. #315

Closed cheqianh closed 8 months ago

cheqianh commented 9 months ago

Description:

This PR adds support for multiple top level use case, so that we have an apple-to-apple performance comparison for Ion/JSON/CBOR.

Note:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cheqianh commented 9 months ago

Sample file multiple_top_level_object.json and multiple_top_level_object.cbor include equivalent data:

{"name":"John", "age":30, "car":null}
{"name":"Mike", "age":33, "car":null}
{"name":"Jack", "age":24, "car":null}
cheqianh commented 9 months ago
The commit 74d89ad refactor loader (read APIs) to separate files. And below are the new results for the baseline file - a large log name file_size(B) time_min(ns) time_mean(ns) memory_usage_peak(B)
Ion - before 22263038 4730782811 171,700
Ion - after 22263038 5092575416.60 5109052881 276,500
json - before 145330385 875974056 77,446
json - after 145330385 871989650.00 872934369 112,139
cbor - before 116960762 995595161 51,222
cbor - after 116960762 985745625.00 991288347 87,769

Noticed that Ion is slower and has a higher memory peak while JSON and CBOR are a little bit faster.

cheqianh commented 9 months ago

Below is the table showing the differences in write performance after the dump refactor commit - 6dc3d96.

name file_size(B) time_min(ns) time_mean(ns) memory_usage_peak(B)
Ion - before 22263038 2,332,061,975 (dump one by one)
2,010,238,558 (dump everything at once)
43540 (dump one by one)
42,573,352 (dump everything at once)
Ion - after 22263038 ~4,139,845,225~ (wrong result, didn't use the latest commit)
1,849,542,250 (dump everything at once)
42,573,352
json - before 145330385 3,981,317,292 143,755,789
json - after 145330385 4,018,459,501 143,756,301
cbor - before 116960762 1,303,162,089 19,301
cbor - after 116960762 1,291,955,078 19,813
tgregg commented 9 months ago

Why is Ion read and write performance worse after the refactor?

cheqianh commented 8 months ago

The commit 2b177af addressed the comment. A few comments and highlights below.

  1. Re: https://github.com/amazon-ion/ion-python/pull/315#discussion_r1442404936 It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?

  2. Re: https://github.com/amazon-ion/ion-python/pull/315#discussion_r1442408471 I'm going to investigate why this happens and determine which benchmarking methods are more accurate. Based on the previous results, there is a 20% difference between these two approaches.

  3. After the refactor, the Ion write performance dropped from 2e^9 to 4e^9 I'm not sure why; I'm investigating it now.

  4. There are some conflicts with the main branch. (I guess it's because of the recent bare_value PR) I'll work on them at the end.

rmarrowstone commented 8 months ago

It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?

Can't we just pull off the generator and do nothing with the values in the test_fun?

cheqianh commented 8 months ago

The highlight No.3 in my comment above is solved and I updated the metrics table for dump APIs.

This is because for the before metrics, we benchmarked the latest ion-python commit to see how much improvement we have made for the write side. However, since the PR is not merged yet so that the after result is still based on the latest released version 0.11.3. I manually overwrite it and updated the metrics. In the future I will use this commit d425587 for benchmarking comparison in this PR (the current latest commit).

cheqianh commented 8 months ago

I made the changes in three separate commits for easier visibility. To review them together, you can find the link here.

As discussed offline, a fair apple-to-apple comparison would involve fully marshalling all top-value objects into memory. The library would then write them to the destination file as a stream.

Here are the details:

I also modified the benchmark_spec.get_data_object() method to return a list instead of a generator. This change addresses a GHA pipeline failure on Windows PYPY. The root is because the method need to keep the file open when returning a generator, which will cause invalid accesses by other processes on some platforms.