NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

Referential integrity check takes much longer than it seems it should #931

Open rgdeen opened 5 months ago

rgdeen commented 5 months ago

Checked for duplicates

No - I haven't checked

🐛 Describe the bug

"bug" Is a strong word, but it's the closest category.

I have a bundle (MSAM2) with ~2 million products in it. I can do product-level validation in parallel using KDP or Nucleus or other technology to farm it out to a bunch of nodes. However, referential integrity (verifying the inventory files are correct and match the files present) has to be done on the bundle as a whole - I'm not aware of any way to split that up. (maybe by collection, but there are only 2 relevant collections here so that doesn't help much).

In order to do this, I'm running with product and content validation turned off. But it is still taking an inordinate amount of time. As of this writing, it's been running 6 days and per the log has gotten through 1,047,476 out of 2,013,873 products - about halfway. That's a rate of about 2 per second. Seems like it should be able to do better in this case.

🕵️ Expected behavior

Well I expected what I got ;-) but I would hope the RI checks could be faster.

📜 To Reproduce

Here's the command line:

/path/to/msam2/validate-3.5.1/bin/validate -target /path/to/msam2/annex_ehlmann_caltech_msl_msam2 --report-file bundle.valrpt -R pds4.bundle --skip-content-validation --skip-product-validation

🖥 Environment Info

$ uname -a
Linux machine-name 3.10.0-1160.76.1.el7.x86_64 #1 SMP Tue Jul 26 14:15:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ java -version
java version "17.0.11" 2024-04-16 LTS
Java(TM) SE Runtime Environment (build 17.0.11+7-LTS-207)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.11+7-LTS-207, mixed mode, sharing)

📚 Version of Software Used

$ /mnt/pdsdata/scratch/rgd/msam2/validate-3.5.1/bin/validate -version

gov.nasa.pds:validate
Version 3.5.1
Release Date: 2024-05-25 17:45:47

Copyright 2019, by the California Institute of Technology ("Caltech").
All rights reserved.

🩺 Test Data / Additional context

No response

🦄 Related requirements

No response

⚙️ Engineering Details

No response

🎉 Integration & Test

No response

al-niessner commented 5 months ago

@rgdeen Can you tell me if it is slowing over time? We previously fixed slowness for #824, which should be part of the validate you are using, and knowing if it is slowing may help shorten the fix time. Could be a red herring too.

al-niessner commented 5 months ago

@jordanpadams

Running using my 12000 label bundle from pixl and each label takes some amount of time - varies wildly depending on bundle - but after 500 labels in 1.5 minutes I paused it for laptop battery. I can run it until the end if we care when plugged into the wall. Point is 90 seconds for 500 is 5ish every second. I need to log it and see if it is slowing down.

However, food for thought. Even though product and content validation is being ignored, it is still opening and processing the label at some level because that is how it builds the reference list. In other words, some amount of LabelValidator is taking place to make sure that we can safely parse the label for the lidvid. I can verify how much but it probably includes schematron since that indicates XML sanity in a nutshell. If the labels are complicated and take some amount of time for minimal processing - even if just sax parsing for the XPath lookups - then that could be the slowness. Watching my 500 some took a second and others flew by at 10s per second. Have no idea what the differences are.

@rgdeen Can you attach to send me the log file as it is now? I can look in it to see if there is some helpful timing information. I get INFO messages that your log file may not have, but if it does I can see what is happening with label validation with respect to timing.

al-niessner commented 5 months ago

Okay, took 1617 seconds on my laptop to do 11140 labels. 7ish labels per second. It would still be 3.5 days for the annex_ehlmann_caltech_msl_msam2 bundle. The log file does not show a slowing trend over the 11000ish labels.

@jordanpadams Where do you want to go from here? I would need that whole bundle to see where the time is being spent. Given the linux kernel is 3 series (see first comment in this ticket) it is quite possible the computer is quite old and the 3.5x difference is age of computer but I doubt it. More likely NFS or something like that if it is environmental at all. Probably just the complexity of the labels themselves are the 3.5x difference or, at the least, a significant contributor.

jordanpadams commented 5 months ago

@al-niessner now that you mention it, it very well may be NFS.

@rgdeen any way you can try running validate from the NFS mount versus from your home directory to see if that improves performance at all?

jordanpadams commented 5 months ago

@al-niessner now that you mention it, it very well may be NFS.

@rgdeen any way you can try running validate from the NFS mount versus from your home directory to see if that improves performance at all?

jordanpadams commented 5 months ago

@al-niessner now that you mention it, it very well may be NFS.

@rgdeen any way you can try running validate from the NFS mount versus from your home directory to see if that improves performance at all?

jordanpadams commented 5 months ago

@al-niessner now that you mention it, it very well may be NFS.

@rgdeen any way you can try running validate from the NFS mount versus from your home directory to see if that improves performance at all?

rgdeen commented 5 months ago

It's pdsimg-int1 which is an old machine and there are other things running on it, so that might explain some of it. Have you explored multithreading for this case? Could help significantly.

The label files themselves are huge and I don't think they'd fit on the home dir.

I don't know that any explicit action is needed here (unless you wanted to explore multithreading ;-) ), I just wanted to raise it as a concern.

jordanpadams commented 5 months ago

~@rgdeen we cannot multi-thread with the current Validate implementation because JAXB is not thread-safe~

@rgdeen it would be very costly to refactor and multi-thread Validate because Saxon is not thread-safe

rgdeen commented 5 months ago

wow in this day and age? :-O Prob not worth the trouble then to read all the xml's in one thread and do processing in another. I'm surprised there's not a thread-safe implementation.

rgdeen commented 5 months ago

FYI, my job died with an out-of-memory error after 11 days and 1.9M out of 2M products. Separate issue. But anyway, I started it again, but this time on a "m7i.large" ec2 instance. It's running at a rate of about 1100/min (18/sec), which means a rerun time of about 30 hours. That's not entirely unreasonable.

So I guess we can chalk this up to underpowered on-prem machine.

Threaded execution sure would be nice ;-) but sounds like that's an issue for another day.

al-niessner commented 5 months ago

@rgdeen

Unless you edited validate script it will still fail within 30 hours. You need to change the -Xmx4096m to a bigger number.

rgdeen commented 5 months ago

I did... ;-)