airspeed-velocity / asv

Airspeed Velocity: A simple Python benchmarking tool with web-based reporting
https://asv.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
869 stars 180 forks source link

Add param names to results format #882

Open Alphare opened 5 years ago

Alphare commented 5 years ago

I saw the PR about the new results format (#847) and while I am very happy to see these kinds of improvements, there still remains a major issue with the results with regards to forward-compatibility.

We have a lot of benchmarks which parameters we update frequently, and migrating the old results in a big chore. Would you object to adding the parameter names in an array so that we can easily construct a matrix from the file in isolation?

pv commented 5 years ago

Iirc, it was not included previously either, as the parameter names are used only for display purposes.

Generally, the conversion strategy in results.py should be complete, so maybe you want the new field for some custom purpose?

The parameter names probably won't increase the file size significantly, probably best to check though, so theres no big concern. However, they were intended mainly for display purposes.

On September 16, 2019 2:19:30 PM UTC, "Raphaël Gomès" notifications@github.com wrote:

I saw the PR about the new results format (#847) and while I am very happy to see these kinds of improvements, there still remains a major issue with the results with regards to forward-compatibility.

We have a lot of benchmarks which parameters we update frequently, and migrating the old results in a big chore. Would you object to adding the parameter names in an array so that we can easily construct a matrix from the file in isolation?

Alphare commented 5 years ago

It was never in the results files, indeed. It's an old issue we've had with the format, but I'm only bringing this up now because we're going to start a lot more benchmarking.

I don't think size will be an issue. JSON is already pretty wasteful in itself, and repeating those parameters would be an excellent use case for compression.

I only suggested the additional array approach because moving to a object would be most likely too big a change for upstream, but it would be my preferred approach but rebuilding the matrix is trivial.

Alphare commented 5 years ago

After more digging around in the results format, we decided to opt for the following:

  1. Use objects instead of arrays to facilitate results manipulation (no more zips and products)
  2. Move the parameter names from the benchmarks.json to the results to prevent desync, also see 1)
  3. In general be as effortless to work with as possible

It seems to me that the main advantage of JSON is the ease-of-use both from a human's and a computer's perspective, not particularly compactness. This is the function of compression software, binary formats and, more to the point, databases. If there really are concerns (or even experiences) related to the disk usage of results, maybe we should consider an alternate SQLite backend for example.

Forward-compatibility is a must for us, I hope we can find common ground on this issue.

pv commented 4 years ago

The change in the file format was partly motivated by real concerns and experiences in space usage, which is not only a question of disk space (today practically unlimited).

Using SQL/sqlite backend is one option I considered, but maintaining two backends does not seem a good idea at this point, nor is it particularly suited for the append-new-files approach. Similarly, just gzipping the JSON files was considered, but that breaks deduplicating, diffs, etc text-like handling of the files.

The larger motivation however was that the v1 format had grown various quirks with the attempt to keep backward compatibility, and was not particularly easy to work with. My take is that a "flat" database-style format is preferable and simpler in practice for this type of data than a deeply nested object-oriented format. The v2 format is closer to this, but not fully there as also the parametrized benchmarks probably should be split out in separate top-level benchmarks entries rather than grouped in lists, but that's left for later.

The precise detail of space optimization by repeating key names (ie. zip/product) or not in JSON I think is unimportant detail as it can be encapsulated in a small piece of code. However, it was a cheap way to achieve additional space savings at the same time as revising the file format.