NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Collection iteration optimization #115

Closed alexdunnjpl closed 2 months ago

alexdunnjpl commented 2 months ago

🗒️ Summary

Overhauls ancestry sweeper - non-essential global state is eliminated in favor of iterating over collections when generating updates.

This results in vastly less peak memory usage while avoiding the need to use HDD for swap, and facilitates incremental processing even if job is killed or otherwise fails.

Squash when merging.

N.B. tqdm is introduced, but I've experienced some instability when running in debug mode in Pycharm. Investigation suggests this is platform-specific, so I'm keeping it in there, and a future update should resolve this, but it's something to be aware of and disabling the tqdm progress bar may be necessary during development.

This bug may be worked around by disabling the cython speedup extension with env var PYDEVD_USE_CYTHON=NO. This information has been added to the dev section in the readme

⚙️ Test Data and/or Report

One of the following should be included here:

♻️ Related Issues

related to #108