NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

OutOfMemoryError when NASA validate v3.5.2 is executed through a library for a batch of products #979

Closed daniel-rincon-garcia closed 1 month ago

daniel-rincon-garcia commented 2 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When NASA validate v3.5.2 is executed through a JAR library for a batch of products in a Java v17 project, I noticed that the heap memory is not released in each product validate iteration so when the library validated some products an OutOfMemoryError exception is raised.

java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
    at org.apache.catalina.core.StandardServer.startPeriodicLifecycleEvent(StandardServer.java:937)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.OutOfMemoryError: Java heap space

I've seen a similar issue: https://github.com/NASA-PDS/validate/issues/826 but it seems that this is a bit different.

🕵️ Expected behavior

I'd expect that the memory was released in each product validate iteration or at least that the memory was released before reach the heap size to avoid the out of memory exception.

📜 To Reproduce

  1. Create a Java 17 project that uses NASA validate v3.5.2 as a jar project library.
  2. Create a ValidateLauncher object and a String array intended to be use as arguments for ValidateLauncher object.
    final ValidateLauncher launcher = new ValidateLauncher();
    String[] arguments = {
                    "-C", "pds4-validate-catalog.xml",
                    "-e", "xml",
                    "-E", String.valueOf(Long.MAX_VALUE),
                    "-s", "json",
                    "--add-context-products", "local_context_products.json",
                    "-t", "cam_raw_sc_cam3_image_20210812t041402_61_f__t0005.xml",
                    "-r", "/tmp/PDS4_VALIDATION_REPORTS/pds_validate_report_bc_mtm_mcam_1724407870771_b296d73d-9536-4ac3-baee-9b40faf4fab3.json"
            };

    pds4-validate-catalog.zip local_context_products.json cam_raw_sc_cam3_20210812t041402_61_f__t0005.zip

  3. Call launcher.processMain(arguments); and release the object:
            launcher.flushValidators();
            CrossLabelFileAreaReferenceChecker.reset();
  4. Repeat 1 and 2 several times and check the memory heap usage (you can use the VisualVM tool to analyze it).

🖥 Environment Info

📚 Version of Software Used

Version 3.5.2

🩺 Test Data / Additional context

VisualVM v1.4.3 was used to analyze the heap memory and as we can see, the heap memory is not released:

heapMemory

Finally, an out of memory exception happen.

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

jordanpadams commented 2 months ago

@daniel-rincon-garcia thanks for profiling this for us! we will take a look. there must be some objects being left open.

jordanpadams commented 2 months ago

@daniel-rincon-garcia in the interim, I would look to increase your heap size to support your validation needs. Will let you know when we have a fix ready.

jordanpadams commented 2 months ago

@al-niessner Registry / API work takes precedence, but this is top priority for validate

al-niessner commented 2 months ago

@jordanpadams

Yes, validate's memory leak is well known and the rest of this reply is not pleasant to read either. validate is meant (current design and implementation) to be restarted each time for a batch of items; meaning, it leaks memory so exiting the process is the best cleanup. One user has 2+ million labels and had to increase the heap size to 64 GB or more to get the job done. Making the heap large enough to accommodate the full job or spawning validate as a new java process are unsatisfying but the shortest and most robust path to a solution.

The regression/unit test makes a minor attempt to clean up behind itself (the clean up calls you list), but it is only good enough to run through the suite of tests; meaning, those calls are not perfect and no substitute for shutting validate down and restarting. Part of the issue is that some third party units, like schematron, have its own permanent memory (leaks over labels because it is never released) that is outside the control of validate. Given the number of labels that some user process, any of these leaks will eventually become dominate/noticeable.

The predominate memory leak is the philosophy of validate. It makes the assumption that all labels when launched belong to a cohort that needs to be crossed analyzed. There are lots of caches, like schematron, that use this cache to prevent long operations, downloading the schematron, from dominating. validate itself does the same with content in the label like the lidvid and the URL to fetch it are retained. These then get stored in forever lists that are never clean up because if more labels are found those cross links are thought to be necessary. Therefore, validate does not reclaim them until all labels are found which is defined by validate exiting. Not a good design for you, but the one that exists and it is very clear without ambiguity when to reclaim all memory.

daniel-rincon-garcia commented 2 months ago

Thank you both very much for the detailed explanation of the problem and for the suggested idea to increase the heap size. For the moment, we restart each time the process to avoid the memory leak.

msbentley commented 2 months ago

I wonder if some combination of switches could help in our case? In particular I would note that due to the "operational" way in which we are archiving we:

It sounds like our use case could in principle avoid many of the issues above, if we use the right switches, and if those switches indeed avoid those issues at the software level?

Edit: The only thing I can see right now is --skip-context-reference-check since by default the rule is already pds4.label, so perhaps there is not much to gain here?

al-niessner commented 2 months ago

Restart is the more robust approach. It is also preferred. If you are trying to clear all caches for memory performance, then restarting adds no appreciable start up overhead to the clearing of memory - going to reload from network or disk everything either way.

Whether you download from the network or read from disk, the third party libraries always cache. It will always save you time to read from disk though because the bus is faster, but makes no difference in memory usage.

Skipping reference checking switch does exactly that and only that. It does not prevent the collection of the information for reference checking because it is also needed for reporting and aggregate labels. Again, may save you some time but will not have much if any impact on memory usage.

jordanpadams commented 2 months ago

@msbentley @daniel-rincon-garcia we will work to improve our documentation here as well to note that it recommended to batch validate executions to avoid this out of memory issue. The size of the batch is dependent upon the size of the memory you are able to allocate to the JVM.