The pyQuARC tool reads and evaluates metadata records with a focus on the consistency and robustness of the metadata. pyQuARC flags opportunities to improve or add to contextual metadata information in order to help the user connect to relevant data products. pyQuARC also ensures that information common to both the data product and the file-level metadata are consistent and compatible. pyQuARC frees up human evaluators to make more sophisticated assessments such as whether an abstract accurately describes the data and provides the correct contextual information. The base pyQuARC package assesses descriptive metadata used to catalog Earth observation data products and files. As open source software, pyQuARC can be adapted and customized by data providers to allow for quality checks that evolve with their needs, including checking metadata not covered in base package.
Apache License 2.0
19
stars
0
forks
source link
Implement Multithreading for Enhanced Performance in Custom Check Processing #284
This pull request introduces a multithreading solution to enhance the performance of custom check processing in the codebase. The existing codebase comprises various rules applied to multiple fields. During execution, it became evident that parallel processing is necessary at two levels: the field level and the argument level.
At the field level, a check needs to be performed for multiple fields simultaneously. Meanwhile, at the argument level, a check traverses through multiple arguments within a single field. Therefore, nested multithreading is required to fully improve the overall performance of the project.
By leveraging multithreading at both levels, we aim to parallelize the execution of checks, thereby significantly improving performance and efficiency, especially in scenarios involving a large number of checks or resolving URLs.
Changes:
[ ] Introduces new functions process_argument in custom_checker.py and _process_field in checker.py to segregate the processes for parallel execution.
[ ] Refactors the existing loop in the run function to utilize multithreading for parallel execution of custom checks.
[ ] Enhances code readability and maintainability by separating concerns and encapsulating logic into reusable functions.
[ ] Adds appropriate documentation and comments to explain the multithreading implementation and its benefits.
[ ] Additional error handling for multi-threaded modules
Testing:
Extensive testing has been conducted to ensure the correctness and performance of the multithreading solution.
Integration tests have been performed to validate the code's functionality in various scenarios and edge cases.
Impact:
This change significantly improves the performance of custom check processing, especially in scenarios involving a large number of checks or resolving URLs.
Improved the efficiency of the execution on average from ~100sto~10s
Description:
This pull request introduces a multithreading solution to enhance the performance of custom check processing in the codebase. The existing codebase comprises various rules applied to multiple fields. During execution, it became evident that parallel processing is necessary at two levels: the field level and the argument level.
At the field level, a check needs to be performed for multiple fields simultaneously. Meanwhile, at the argument level, a check traverses through multiple arguments within a single field. Therefore, nested multithreading is required to fully improve the overall performance of the project.
By leveraging multithreading at both levels, we aim to parallelize the execution of checks, thereby significantly improving performance and efficiency, especially in scenarios involving a large number of checks or resolving URLs.
Changes:
process_argument
in custom_checker.py and_process_field
in checker.py to segregate the processes for parallel execution.Testing:
Extensive testing has been conducted to ensure the correctness and performance of the multithreading solution. Integration tests have been performed to validate the code's functionality in various scenarios and edge cases.
Impact:
This change significantly improves the performance of custom check processing, especially in scenarios involving a large number of checks or resolving URLs. Improved the efficiency of the execution on average from
~100s
to~10s