Closed cheqianh closed 10 months ago
Sample file multiple_top_level_object.json
and multiple_top_level_object.cbor
include equivalent data:
{"name":"John", "age":30, "car":null}
{"name":"Mike", "age":33, "car":null}
{"name":"Jack", "age":24, "car":null}
The commit 74d89ad refactor loader (read APIs) to separate files. And below are the new results for the baseline file - a large log | name | file_size(B) | time_min(ns) | time_mean(ns) | memory_usage_peak(B) |
---|---|---|---|---|---|
Ion - before | 22263038 | 4730782811 | 171,700 | ||
Ion - after | 22263038 | 5092575416.60 | 5109052881 | 276,500 | |
json - before | 145330385 | 875974056 | 77,446 | ||
json - after | 145330385 | 871989650.00 | 872934369 | 112,139 | |
cbor - before | 116960762 | 995595161 | 51,222 | ||
cbor - after | 116960762 | 985745625.00 | 991288347 | 87,769 |
Noticed that Ion is slower and has a higher memory peak while JSON and CBOR are a little bit faster.
Below is the table showing the differences in write performance after the dump refactor commit - 6dc3d96.
name | file_size(B) | time_min(ns) | time_mean(ns) | memory_usage_peak(B) |
---|---|---|---|---|
Ion - before | 22263038 | 2,332,061,975 (dump one by one) 2,010,238,558 (dump everything at once) |
43540 (dump one by one) 42,573,352 (dump everything at once) |
|
Ion - after | 22263038 | ~4,139,845,225~ (wrong result, didn't use the latest commit) 1,849,542,250 (dump everything at once) |
42,573,352 | |
json - before | 145330385 | 3,981,317,292 | 143,755,789 | |
json - after | 145330385 | 4,018,459,501 | 143,756,301 | |
cbor - before | 116960762 | 1,303,162,089 | 19,301 | |
cbor - after | 116960762 | 1,291,955,078 | 19,813 |
Why is Ion read and write performance worse after the refactor?
The commit 2b177af addressed the comment. A few comments and highlights below.
Re: https://github.com/amazon-ion/ion-python/pull/315#discussion_r1442404936
It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun
. I refactored the benchmark_spec.get_data_object()
method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator)
to convert it into a list before passing it to test_fun
. Do you have any concerns or comments about this?
Re: https://github.com/amazon-ion/ion-python/pull/315#discussion_r1442408471 I'm going to investigate why this happens and determine which benchmarking methods are more accurate. Based on the previous results, there is a 20% difference between these two approaches.
After the refactor, the Ion write performance dropped from 2e^9 to 4e^9 I'm not sure why; I'm investigating it now.
There are some conflicts with the main branch. (I guess it's because of the recent bare_value PR) I'll work on them at the end.
It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined
test_fun
. I refactored thebenchmark_spec.get_data_object()
method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still calllist(the_generator)
to convert it into a list before passing it totest_fun
. Do you have any concerns or comments about this?
Can't we just pull off the generator and do nothing with the values in the test_fun
?
The highlight No.3 in my comment above is solved and I updated the metrics table for dump APIs.
This is because for the before
metrics, we benchmarked the latest ion-python commit to see how much improvement we have made for the write side. However, since the PR is not merged yet so that the after
result is still based on the latest released version 0.11.3. I manually overwrite it and updated the metrics. In the future I will use this commit d425587 for benchmarking comparison in this PR (the current latest commit).
I made the changes in three separate commits for easier visibility. To review them together, you can find the link here.
As discussed offline, a fair apple-to-apple comparison would involve fully marshalling all top-value objects into memory. The library would then write them to the destination file as a stream.
Here are the details:
load_dump.load()
returns an iterator. The benchmark-cli will accumulate the execution time for deserializing each top-level object.test_fun
with a series of top-level objects and benchmark the time it takes to write these objects as a stream.I also modified the benchmark_spec.get_data_object()
method to return a list instead of a generator. This change addresses a GHA pipeline failure on Windows PYPY. The root is because the method need to keep the file open when returning a generator, which will cause invalid accesses by other processes on some platforms.
Description:
This PR adds support for multiple top level use case, so that we have an apple-to-apple performance comparison for Ion/JSON/CBOR.
Note:
This PR focuses on
io_type=file
, we might consider addingio_type=buffer
later but we need to figure out how to fairly use JSON and CBOR's load/dump APIs to iterate each of multiple top level object - https://github.com/amazon-ion/ion-python/issues/325.When we benchmark read speed for JSON document that includes multiple top level objects, we suppose that the given data is in JSONL. So we have to modify our JSON sample data - code here
I'm not sure if protobuf has the same issue for the multiple top-level objects use case. This need to be figure out when we come back to protobuf in the future - https://github.com/amazon-ion/ion-python/issues/326.
Three files I used for test locally, I upload them for visible but they will not be included when merge this PR into the main branch. They are: 1 amazon/ionbenchmark/json_newline_vs_list_repro.py 2 amazon/ionbenchmark/format_conversion_correct.py 3 amazon/ionbenchmark/format_conversion_inaccurate.py
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.