Closed daniel-rincon-garcia closed 1 month ago
@daniel-rincon-garcia thanks for profiling this for us! we will take a look. there must be some objects being left open.
@daniel-rincon-garcia in the interim, I would look to increase your heap size to support your validation needs. Will let you know when we have a fix ready.
@al-niessner Registry / API work takes precedence, but this is top priority for validate
@jordanpadams
Yes, validate's memory leak is well known and the rest of this reply is not pleasant to read either. validate is meant (current design and implementation) to be restarted each time for a batch of items; meaning, it leaks memory so exiting the process is the best cleanup. One user has 2+ million labels and had to increase the heap size to 64 GB or more to get the job done. Making the heap large enough to accommodate the full job or spawning validate as a new java process are unsatisfying but the shortest and most robust path to a solution.
The regression/unit test makes a minor attempt to clean up behind itself (the clean up calls you list), but it is only good enough to run through the suite of tests; meaning, those calls are not perfect and no substitute for shutting validate down and restarting. Part of the issue is that some third party units, like schematron, have its own permanent memory (leaks over labels because it is never released) that is outside the control of validate. Given the number of labels that some user process, any of these leaks will eventually become dominate/noticeable.
The predominate memory leak is the philosophy of validate. It makes the assumption that all labels when launched belong to a cohort that needs to be crossed analyzed. There are lots of caches, like schematron, that use this cache to prevent long operations, downloading the schematron, from dominating. validate itself does the same with content in the label like the lidvid and the URL to fetch it are retained. These then get stored in forever lists that are never clean up because if more labels are found those cross links are thought to be necessary. Therefore, validate does not reclaim them until all labels are found which is defined by validate exiting. Not a good design for you, but the one that exists and it is very clear without ambiguity when to reclaim all memory.
Thank you both very much for the detailed explanation of the problem and for the suggested idea to increase the heap size. For the moment, we restart each time the process to avoid the memory leak.
I wonder if some combination of switches could help in our case? In particular I would note that due to the "operational" way in which we are archiving we:
It sounds like our use case could in principle avoid many of the issues above, if we use the right switches, and if those switches indeed avoid those issues at the software level?
Edit: The only thing I can see right now is --skip-context-reference-check
since by default the rule is already pds4.label
, so perhaps there is not much to gain here?
Restart is the more robust approach. It is also preferred. If you are trying to clear all caches for memory performance, then restarting adds no appreciable start up overhead to the clearing of memory - going to reload from network or disk everything either way.
Whether you download from the network or read from disk, the third party libraries always cache. It will always save you time to read from disk though because the bus is faster, but makes no difference in memory usage.
Skipping reference checking switch does exactly that and only that. It does not prevent the collection of the information for reference checking because it is also needed for reporting and aggregate labels. Again, may save you some time but will not have much if any impact on memory usage.
@msbentley @daniel-rincon-garcia we will work to improve our documentation here as well to note that it recommended to batch validate executions to avoid this out of memory issue. The size of the batch is dependent upon the size of the memory you are able to allocate to the JVM.
Checked for duplicates
Yes - I've already checked
🐛 Describe the bug
When NASA validate v3.5.2 is executed through a JAR library for a batch of products in a Java v17 project, I noticed that the heap memory is not released in each product validate iteration so when the library validated some products an OutOfMemoryError exception is raised.
I've seen a similar issue: https://github.com/NASA-PDS/validate/issues/826 but it seems that this is a bit different.
🕵️ Expected behavior
I'd expect that the memory was released in each product validate iteration or at least that the memory was released before reach the heap size to avoid the out of memory exception.
📜 To Reproduce
pds4-validate-catalog.zip local_context_products.json cam_raw_sc_cam3_20210812t041402_61_f__t0005.zip
launcher.processMain(arguments);
and release the object:🖥 Environment Info
📚 Version of Software Used
Version 3.5.2
🩺 Test Data / Additional context
VisualVM v1.4.3 was used to analyze the heap memory and as we can see, the heap memory is not released:
Finally, an out of memory exception happen.
🦄 Related requirements
🦄 #xyz
⚙️ Engineering Details
No response
🎉 Integration & Test
No response