Enhance the benchmark runner to support multiple top level objects use case.

cheqianh commented 11 months ago

Description:

This PR adds support for multiple top level use case, so that we have an apple-to-apple performance comparison for Ion/JSON/CBOR.

Note:

This PR focuses on io_type=file, we might consider adding io_type=buffer later but we need to figure out how to fairly use JSON and CBOR's load/dump APIs to iterate each of multiple top level object - https://github.com/amazon-ion/ion-python/issues/325.
When we benchmark read speed for JSON document that includes multiple top level objects, we suppose that the given data is in JSONL. So we have to modify our JSON sample data - code here
I'm not sure if protobuf has the same issue for the multiple top-level objects use case. This need to be figure out when we come back to protobuf in the future - https://github.com/amazon-ion/ion-python/issues/326.
Three files I used for test locally, I upload them for visible but they will not be included when merge this PR into the main branch. They are: 1 amazon/ionbenchmark/json_newline_vs_list_repro.py 2 amazon/ionbenchmark/format_conversion_correct.py 3 amazon/ionbenchmark/format_conversion_inaccurate.py

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

cheqianh commented 11 months ago

Sample file multiple_top_level_object.json and multiple_top_level_object.cbor include equivalent data:

{"name":"John", "age":30, "car":null}
{"name":"Mike", "age":33, "car":null}
{"name":"Jack", "age":24, "car":null}

cheqianh commented 10 months ago

The commit 74d89ad refactor loader (read APIs) to separate files. And below are the new results for the baseline file - a large log	name	file_size(B)	time_min(ns)	time_mean(ns)
Ion - before	22263038		4730782811	171,700
Ion - after	22263038	5092575416.60	5109052881	276,500
json - before	145330385		875974056	77,446
json - after	145330385	871989650.00	872934369	112,139
cbor - before	116960762		995595161	51,222
cbor - after	116960762	985745625.00	991288347	87,769

Noticed that Ion is slower and has a higher memory peak while JSON and CBOR are a little bit faster.

cheqianh commented 10 months ago

Below is the table showing the differences in write performance after the dump refactor commit - 6dc3d96.

name	file_size(B)	time_mean(ns)	memory_usage_peak(B)
Ion - before	22263038	2,332,061,975 (dump one by one) 2,010,238,558 (dump everything at once)	43540 (dump one by one) 42,573,352 (dump everything at once)
Ion - after	22263038	~4,139,845,225~ (wrong result, didn't use the latest commit) 1,849,542,250 (dump everything at once)	42,573,352
json - before	145330385	3,981,317,292	143,755,789
json - after	145330385	4,018,459,501	143,756,301
cbor - before	116960762	1,303,162,089	19,301
cbor - after	116960762	1,291,955,078	19,813

tgregg commented 10 months ago

Why is Ion read and write performance worse after the refactor?

cheqianh commented 10 months ago

The commit 2b177af addressed the comment. A few comments and highlights below.

Re: https://github.com/amazon-ion/ion-python/pull/315#discussion_r1442404936 It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?
Re: https://github.com/amazon-ion/ion-python/pull/315#discussion_r1442408471 I'm going to investigate why this happens and determine which benchmarking methods are more accurate. Based on the previous results, there is a 20% difference between these two approaches.
After the refactor, the Ion write performance dropped from 2e^9 to 4e^9 I'm not sure why; I'm investigating it now.
There are some conflicts with the main branch. (I guess it's because of the recent bare_value PR) I'll work on them at the end.

rmarrowstone commented 10 months ago

It seems that the generator can't be fully consumed by test_fun; it will only parse the first value and return when I pass a generator to a defined test_fun. I refactored the benchmark_spec.get_data_object() method to return a generator and reused it in other places to avoid repeating code. However, for benchmarking, we still call list(the_generator) to convert it into a list before passing it to test_fun. Do you have any concerns or comments about this?

Can't we just pull off the generator and do nothing with the values in the test_fun?

cheqianh commented 10 months ago

The highlight No.3 in my comment above is solved and I updated the metrics table for dump APIs.

This is because for the before metrics, we benchmarked the latest ion-python commit to see how much improvement we have made for the write side. However, since the PR is not merged yet so that the after result is still based on the latest released version 0.11.3. I manually overwrite it and updated the metrics. In the future I will use this commit d425587 for benchmarking comparison in this PR (the current latest commit).

cheqianh commented 10 months ago

I made the changes in three separate commits for easier visibility. To review them together, you can find the link here.

As discussed offline, a fair apple-to-apple comparison would involve fully marshalling all top-value objects into memory. The library would then write them to the destination file as a stream.

Here are the details:

For the loading APIs of all three formats, load_dump.load() returns an iterator. The benchmark-cli will accumulate the execution time for deserializing each top-level object.
For the dumping APIs of all three formats, we provide test_fun with a series of top-level objects and benchmark the time it takes to write these objects as a stream.

I also modified the benchmark_spec.get_data_object() method to return a list instead of a generator. This change addresses a GHA pipeline failure on Windows PYPY. The root is because the method need to keep the file open when returning a generator, which will cause invalid accesses by other processes on some platforms.

amazon-ion / ion-python

Enhance the benchmark runner to support multiple top level objects use case. #315

Description:

Note: