ESI-FAR / Mopo-arrow_sqlite

Getting to know pyarrow and SQlite
0 stars 0 forks source link

Measure reading `parameter_value` JSON blobs #7

Open suvayu opened 2 months ago

suvayu commented 2 months ago

Additional comments

suvayu commented 1 month ago

Just thought I would list couple of options out there:

OleMussmann commented 1 month ago

After checking some more, line_profiler does not make much sense. The code-to-be-profiled needs to be decorated. Furthermore, it only checks how long the literal lines of the code take, without following function calls.

I've been looking into, and documenting

Quite surprisingly, the output of the profilers is rather different and - to me - hard to interpret. That's something I'd like to sort out with you together next time.

suvayu commented 1 month ago

Perfect timing! We can discuss tomorrow :⁠-⁠)

OleMussmann commented 1 month ago

For discussing later, let's check how close these are to real-world applications:

https://github.com/spine-tools/Spine-Database-API/tree/b6ee1cb7f35e9e628436a4912aab15f16e5852f6/benchmarks

suvayu commented 1 month ago

@OleMussmann I added a script and some SQL to the README in the spinedbapi dir that can read the BB_data.sqlite dataset.

The script parses the blob into a list of pandas.Series, maybe a good comparison with whatever Spine DB API does?

OleMussmann commented 1 month ago

Copied from: https://github.com/ESI-FAR/Mopo-arrow_sqlite/issues/8#issuecomment-2261565312

Possible encodings for the index part of map types:

suvayu commented 3 weeks ago

I'm not sure if this is the correct place to document this. A possible starting point for a pyarrow based implementation could be from Antti's experimental branch.

I've now pushed the Apache Arrow spinedb_api branch to Spine-Database-API repository in GitHub as discussed in the Toolbox UX meeting. Its name is 353_apache_arrow if you want to pull and check it out. Don't expect it to do miracles as it is mostly a proof-of-concept and a sandbox for me to get introduced with Arrow. The branch naturally requires pyarrow in addition to the usual spinedb_api dependencies to work. The beef of the branch is the spinedb_api.arrow_value module and the from_database() function in it which provides a drop-in replacement to spinedb_api.parameter_value.from_database(). arrow_value.from_database() returns Arrow types instead of the home-brewn types returned by parameter_value.from_database(). Note, that currently only some scalar types plus arrays and maps are supported. Uneven maps are "flattened" to Arrow tables such that indices missing in some dimensions are marked as null in the table. tests.test_arrow_value contains some unit tests that may be helpful... Or not.

suvayu commented 1 week ago

Another highly recommended profiler: austin-dist.