NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

Slow performance with all content and product validation turned off #969

Closed jordanpadams closed 1 month ago

jordanpadams commented 1 month ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I ran validate on a 2M product bundle with everything disabled expected referential integrity checks, it still takes 2 weeks.

Products: 2013873 Completed in 13.5 days / 324.6 hours

🕵️ Expected behavior

I expected the run to take less time.

📜 To Reproduce

  1. get a large bundle.

Here's the command:

/path/validate-3.5.1/bin/validate -target
/path/msam2/annex_ehlmann_caltech_msl_msam2 --report-file bundle.valrpt \
               -R pds4.bundle \
               --skip-content-validation --skip-product-validation

🖥 Environment Info

Linux OS m7l.2xlarge EC2 instance

📚 Version of Software Used

v3.5.1

🩺 Test Data / Additional context

No response

🦄 Related requirements

Tightly coupled with #931

⚙️ Engineering Details

Per initial analysis by @al-niessner, code block here:

  1. outer loop running over all targets
  2. inner loop running over all targets again but different list a. uses if to see if var from (1) is equal to (2) b. cannot look up using hash because misuse of overly generic Collection<> instead of Set<> or Map<> or something better suited to task

This part of the job will take awhile and I have no way to estimate it for you. It does not seem to be stuck though.

🎉 Integration & Test

No response

al-niessner commented 1 month ago

Using the harvest 143 data set, it took findUnreferencedIdentifiers 72953 ms to process 11139 products or 6.55 ms per product.

The entire job took 1472223 ms. It means that 5% of the time was doing the reference checking. Um, what was validating for the other 95% of the time? With skip product and content validation, what was validating doing for 20 something minutes of the total 20 something minutes. Point is, the reference checking may not be the problem.

Things to do to move forward:

  1. See what validate is doing that takes so much time when content and product validation are being skipped.
  2. Get a data set twice that currently being used (or even bigger) and see how overall time changes along with reference checking time.
  3. Update the reference checking with eye on performance improvement.
al-niessner commented 1 month ago

Loading and reading all of the XML takes about 951 seconds. The reference checking 73 seconds for 1024 seconds which leaves roughly 450 seconds or 7.5 minutes unaccounted for or 30% of overall time.

Looking at it from a per product perspective, there are some outliers that are orders of magnitude away from the norm. The outliers look like - but not verified because it will take too much effort right now - downloads of schema and schematron. Without removing the outliers, the median is 80 ms per product while the mean is 100 ms. It simply means that 10000 ms for downloading really skews the mean. So, median is the one to think about.

Given 80 ms per product and expanding to 2 million is 4 days give or take. Given the possibility of machine performance difference, your earlier email said the preamble part that I am estimating 48 hours took 30 hours. Completely believable. The 325 hours following that doing reference checking suggests a non-linear relationship in the reference checking. Moving on to improving that bit of code.

al-niessner commented 1 month ago

@jordanpadams

Okay, now I am way off into the weeds. This class is all messed up: https://github.com/NASA-PDS/validate/blob/main/src/main/java/gov/nasa/pds/tools/validate/Identifier.java

Not the fault of the coder necessarily. The bad concept that lid maps to latest lidvid is the beginning of the problem then how java uses hashCode() finishes the job. All Java Maps and Sets using Identifier will not work. One has to guarantee that equals() and hashCode() are always in sync. These two functions are not synchronized here and basically break all of the foundation blocks of Java. If had to use explicit lidvid all the time, then this could be corrected. Having a lid mean latest lidvid makes this impossible.

That week you were telling us about, probably need to fix this class then go in and clean up all the pieces from there correctly.

Make hashCode() and equals() be based on toString(). Then make nearEnough() to be the current equals. Should do the trick after fixing all the current equals() that need to be nearEnough().

al-niessner commented 1 month ago

@jordanpadams @rgdeen

Okay, the time waster is not at all what I thought. This eats up 99% (73 seconds vs 100 ms for everything else). Tell me what you want done with the offending code - basically can I delete it. https://github.com/NASA-PDS/validate/blob/0d6f7cebffa03197bfefd54fcfbb74c9888ae6c9/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/FindUnreferencedIdentifiers.java#L75-L80

Yup, that is right. Reporting the INFO message takes all of the time. I have not spent much time with the reporting code but I have the vague notion that it uses lists to store stuff etc for collation and processing later. Not sure why it takes so long but if there are 12000 messages already and it is sorting or doing something weird I can see where time to insert would really grow as the list grew long. The funniest part, I see log.info but not these info message. So not only are they taking forever, they are never seen.

I am going to keep digging because this is just bad, but let me know what you want done with this.

al-niessner commented 1 month ago

Made progress on this one. Sped up the Unreferenced check by 10x. Since the problem was with the ValidateTarget cache not being used, it will require more memory (100s of bytes per product) but saves the time re-creating all of the metadata associated with it - aka repeatedly reading the XML file for the product type.

The next step in the clean up for performance, is to fix Identifier and all the fallout from those changes.

al-niessner commented 1 month ago

@jordanpadams @rgdeen

Completed the second and third steps. What took 73 seconds at the beginning of this task now takes 700 ms. 100x improvement. The over all time has dropped from 1500 seconds to 1200 seconds. That is more than the 73 seconds from just the unreferenced list. 6 minutes are still locked up in ReferenceType. It matches the analysis of the log file which says that each label has dropped to 65 ms and that there is a 400 seconds missing in the log file which seems to be the 6 minutes in ReferenceType. Diving in...

jordanpadams commented 1 month ago

@al-niessner

Okay, the time waster is not at all what I thought. This eats up 99% (73 seconds vs 100 ms for everything else). Tell me what you want done with the offending code - basically can I delete it.

https://github.com/NASA-PDS/validate/blob/0d6f7cebffa03197bfefd54fcfbb74c9888ae6c9/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/FindUnreferencedIdentifiers.java#L75-L80

Yup, that is right. Reporting the INFO message takes all of the time. I have not spent much time with the reporting code but I have the vague notion that it uses lists to store stuff etc for collation and processing later. Not sure why it takes so long but if there are 12000 messages already and it is sorting or doing something weird I can see where time to insert would really grow as the list grew long. The funniest part, I see log.info but not these info message. So not only are they taking forever, they are never seen.

I am going to keep digging because this is just bad, but let me know what you want done with this.

You can remove this info message.

al-niessner commented 1 month ago

It seems to be all cleaned up for this usage. It is now twice as fast (went from 1500 seconds to 700 seconds) overall. The reference checking itself, is about 100x faster. Give it a try when you can.

rgdeen commented 1 month ago

Wow that's great! I suspect some of those speedups are exponential too, so if you are twice as fast on a 25-minute run you might be proportionally much faster than that on a 25-hour or 25-day run.

It takes a fair pushup to make that amount of data available for a larger test. Is it worth testing in the near term? Or wait until the next time we need to use it (a couple months, probably)? How would I even get the updated code?

jordanpadams commented 2 weeks ago

Skipping I&T. This is very hard to test, and developers have done rigorous regression testing here.