Closed JeroenVerstraelen closed 2 weeks ago
added some docs at https://github.com/ESA-APEx/apex_algorithms/blob/main/docs/benchmarking.md
most important aspect to still cover under this ticket:
First results from pushing some metrics as parquet to S3:
and now with unrolling the usage stats:
merged PR #26 which covers the part about partitioned parquet files on S3
Ok I think it's time to close this ticket. There were quite some aspects to it and I'm still unsure about some parts of the current approach, so this feels mostly like a proof of concept solution. Also, not everything of this ticket's original requirements are met, but those were "TBD" anyway.
Some details and discussion about this PoC:
track_metric
provided by pytest plugin apex_algorithm_qa_tools.pytest.pytest_track_metrics
. Currently collected metrics/properties:
these metrics are written in parquet format to S3 (bucket "APEx-benchmarks"), in a folder structure starting at "metrics/v1/metrics.parquet"
data is written using "pyarrow.parquet.write_to_dataset" using existing_data_behavior=overwrite_or_ignore
mode: there is no appending to an existing file, but each run results in a separate file on S3 (e.g. "metrics/v1/metrics.parquet/2024-08/gh-10612685244-0.parquet").
illustration of current "file listing" (containing the results of 5 benchmark runs):
<FileInfo for 'APEx-benchmarks/metrics/v1/metrics.parquet/2024-08': type=FileType.Directory>,
<FileInfo for 'APEx-benchmarks/metrics/v1/metrics.parquet/2024-08/2355975626d246df866dae027936bd3d-0.parquet': type=FileType.File, size=6345>,
<FileInfo for 'APEx-benchmarks/metrics/v1/metrics.parquet/2024-08/7b65314e31614f0fb8390a4b20c70484-0.parquet': type=FileType.File, size=6347>,
<FileInfo for 'APEx-benchmarks/metrics/v1/metrics.parquet/2024-08/d8b177d03e32493fae8bc0ace6fdf5f3-0.parquet': type=FileType.File, size=6345>,
<FileInfo for 'APEx-benchmarks/metrics/v1/metrics.parquet/2024-08/gh-10612404377-0.parquet': type=FileType.File, size=7215>,
<FileInfo for 'APEx-benchmarks/metrics/v1/metrics.parquet/2024-08/gh-10612685244-0.parquet': type=FileType.File, size=7223>]
pyarrow.parquet.read_table
. I guess this is because of lack of S3 dir/file listing permissions. Need more investigation
Add metrics to the initial benchmark tests.
Goal: partitioned parquet file (partitioned on: time based, benchmarking scenarios)