Measure reading `parameter_value` JSON blobs

suvayu commented 2 months ago

Use spinedb_api to access the data
Explore BB_data.sqlite for typical and worst-case amounts of data per data type,
- [ ] where to find them?
- Write code to read database and retrieve data for all (current) data types.
- [ ] Date-time
- [ ] Duration
- [ ] Time-pattern
- [ ] Time-series
- [ ] Array
- [x] Map
profile the steps
- (de)serialisation
- library (spinedb_api) overhead
note that partial read is not possible

Additional comments

Both example DBs are in 0.7 format, and first time you open it using spinedb_api.DatabaseMapping, you should pass the flag migrate=True to convert to 0.8. Drop the flag afterwards.
To get started, you can have a look at dbmap.py
Toolbox uses spinedb_api in a certain way, and at some point, our benchmark needs to mirror that for the benchmark to be accurate in the real-world.

suvayu commented 1 month ago

~Start with lineprof b/c Ole used it in the past (random choice)~ (it's for R)
Maybe you meant line_profiler?
- double check the sampling window fits within the time window of processes of interest

Just thought I would list couple of options out there:

cProfile: in the standard library
scalene: good that it includes a memory profiler, it might be useful to do a memory profile in parallel

OleMussmann commented 1 month ago

After checking some more, line_profiler does not make much sense. The code-to-be-profiled needs to be decorated. Furthermore, it only checks how long the literal lines of the code take, without following function calls.

I've been looking into, and documenting

Deterministic profilers
- cProfile
- yappi
Sampling profilers
- py-spy
- scalene
- pyinstrument

Quite surprisingly, the output of the profilers is rather different and - to me - hard to interpret. That's something I'd like to sort out with you together next time.

suvayu commented 1 month ago

Perfect timing! We can discuss tomorrow :⁠-⁠)

OleMussmann commented 1 month ago

For discussing later, let's check how close these are to real-world applications:

https://github.com/spine-tools/Spine-Database-API/tree/b6ee1cb7f35e9e628436a4912aab15f16e5852f6/benchmarks

suvayu commented 1 month ago

@OleMussmann I added a script and some SQL to the README in the spinedbapi dir that can read the BB_data.sqlite dataset.

The script parses the blob into a list of pandas.Series, maybe a good comparison with whatever Spine DB API does?

OleMussmann commented 1 month ago

Copied from: https://github.com/ESI-FAR/Mopo-arrow_sqlite/issues/8#issuecomment-2261565312

Possible encodings for the index part of map types:

dictionary encoded array (equivalent to a pandas.Categorical)
Run Length Encoding: RunEncodedArray in PyArrow

suvayu commented 3 weeks ago

I'm not sure if this is the correct place to document this. A possible starting point for a pyarrow based implementation could be from Antti's experimental branch.

I've now pushed the Apache Arrow spinedb_api branch to Spine-Database-API repository in GitHub as discussed in the Toolbox UX meeting. Its name is 353_apache_arrow if you want to pull and check it out. Don't expect it to do miracles as it is mostly a proof-of-concept and a sandbox for me to get introduced with Arrow. The branch naturally requires pyarrow in addition to the usual spinedb_api dependencies to work. The beef of the branch is the spinedb_api.arrow_value module and the from_database() function in it which provides a drop-in replacement to spinedb_api.parameter_value.from_database(). arrow_value.from_database() returns Arrow types instead of the home-brewn types returned by parameter_value.from_database(). Note, that currently only some scalar types plus arrays and maps are supported. Uneven maps are "flattened" to Arrow tables such that indices missing in some dimensions are marked as null in the table. tests.test_arrow_value contains some unit tests that may be helpful... Or not.

suvayu commented 1 week ago

Another highly recommended profiler: austin-dist.

ESI-FAR / Mopo-arrow_sqlite

Measure reading `parameter_value` JSON blobs #7

Additional comments