The process of returning query results from execute() is memory inefficient, as multiple intermediate copies of the result data are maintained simultaneously.
In the case of docs generate, we are sometimes querying for information about every column in a schema. This can mean that a million or more records are returned in more extreme cases, resulting in gigabytes of memory allocation. In this scenario, maintaining multiple copies of the results, even temporarily, is untenable.
Solution
Yield data rows one by one from process_results() rather than returning every row as a list, to eliminate one full copy of the result table. We could still do more work in this direction, but I documented a 33% reduction in memory associated with the get_catalog query with this approach.
[x] I have run this code in development, and it appears to resolve the stated issue
[x] This PR includes tests, or tests are not required/relevant for this PR
[x] This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.
resolves #218
Problem
The process of returning query results from execute() is memory inefficient, as multiple intermediate copies of the result data are maintained simultaneously.
In the case of
docs generate
, we are sometimes querying for information about every column in a schema. This can mean that a million or more records are returned in more extreme cases, resulting in gigabytes of memory allocation. In this scenario, maintaining multiple copies of the results, even temporarily, is untenable.Solution
Yield data rows one by one from process_results() rather than returning every row as a list, to eliminate one full copy of the result table. We could still do more work in this direction, but I documented a 33% reduction in memory associated with the
get_catalog
query with this approach.Checklist