distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
186 stars 108 forks source link

Consider a way to iterate over the contents of an `INDEX_MAP` without requiring the entire JSON document be loaded into memory #2505

Open portante opened 2 years ago

portante commented 2 years ago

As a result of reviewing PR #2492, which switches to using the streaming_bulk() helper API provided by the elasticsearch Python3 interface, fetching all the documents which need to be updated requires the entire tracking set be loaded into memory.

With the goal of the change of using the streaming_bulk() API to allow for smaller individual requests to an Elasticsearch instance, it would be also be nice to avoid the pbench server from loading potentially large data sets entirely into memory while streaming them out.

From the original comment in PR #2492 (_Originally posted by @portante in https://github.com/distributed-system-analysis/pbench/pull/2492#discussion_r732348376_):

How big will the contents of map be at times? If we have multiple publish APIs happening at the same time, do we need to worry about memory consumption here? I am wondering if getvalue can become a generator itself.

dbutenhof commented 2 years ago

From the Jira planning note I wrote earlier:

STORY[S]: re-think document map metadata to allow consuming the map piecemeal (e.g., by index) rather than pulling in and managing the entire map in one JSON document. For example, one key might be a list of index names in the dataset, with separate keys for the list of documents in each index. The hierarchical structure of metadata would easily accommodate this: e.g., instead of “map”: {“index1”: [“id1, “id2”], “index2”: [“id3”, “id4”] …}, something like “indices”: [“index1”, “index2”], “index1”: [“id1”, “id2”], and “index3”: [“id3”, “id4”] as separate metadata keys… we have to be a little careful about how we nest data because of the way the JSON metadata document is stored.

Note that if we can get rid of the unit/legacy test sqlite3 DB, PostgreSQL supports native JSON column queries that would allow us to query nested fields of a SQL JSON column directly to better manage server memory, rather than reading the entire column value and pulling it apart or changing the way it's stored. (This is one of those "test strategy" things where we really need to get rid of the "mocked functional test" environment in favor of real isolated unit testing and full toolchain functional testing.)