NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

validate fails to process large data file #327

Closed msbentley closed 3 years ago

msbentley commented 3 years ago

🐛 Describe the bug

When trying to validate a large (4.8 GB) data file, validate does not crash, but appears to show no output - after 15 minutes no "dots" or other output are displayed, but CPU usage is still high.

📜 To Reproduce

Steps to reproduce the behavior:

  1. Try to validate (with content validation) a large file

🕵️ Expected behavior

Some indication of progress is given (this is especially important for large files, and the chunk size must be appropriate to the expected duration of the validation)

📚 Version of Software Used

validate 2.0.2

🩺 Test Data / Additional context

Test data is proprietary and too large to upload here, but can be provided offline if required.

🖥 System Info


Most likely related to https://github.com/NASA-PDS/pds4-jparser/pull/31/files

qchaupds commented 3 years ago

@msbentley Mark, can you tell us the file extension of the file in question so we can target our development? The issue was fixed for some products but not all back in November of 2020. The one we are currently creating a test file for is .csv and it will take some time to create as there isn't one that large in our test artifacts. The one created yesterday is not a true .csv file as only the first 5 records had a carriage/line feed, the rest were random data and to perform a validation for CSV, we need to have all records comply.

msbentley commented 3 years ago

Hi @qchaupds - this is actually a rather complicated file, which we are trying to discourage, but at the moment this is what we have - the label describes 26322 (!) Table_Character tables each containing 120 records, each with a group field repetition (160 times) of a single field. The rationale was to try to encode a cube in ASCII, I believe. It is only a secondary data format, since the primary data are in an array, but the team wanted an ASCII alternative as well (and since arrays can only use binary types...). So we have 26k frames from a 120x160 detector, encoded as ASCII_Integer.

I could hazard a guess that it is the number of Table_Character instances that is throwing validate, rather than the sheer size of the file...?

qchaupds commented 3 years ago

@msbentley Hi Mark, can you tell use the file extension you are using? Is it ".csv" or ".tab"? Validate uses the file extension to use the correct validator. The one we are trying out now is a .csv file (4GB) and is having good success so far.

qchaupds commented 3 years ago

Also @msbentley, inorder for us to test the file you have, we may need a copy of it on our system.

We can try to test with files we create but won't know until we try on the actual file it is failing for you.

qchaupds commented 3 years ago

@msbentley Since I don't have access to your test file, I would suggest the doing the following meanwhile:

Reduce the number of records to just a few, and run validate against that. If that works, then we know that it is not the format of the file that is failing. When we have the software you can use, you can run validate against the larger file.

In the next few days, we will try to create a more realistic .csv files with more valid records for validate to read.

qchaupds commented 3 years ago

Here's a recent successful run of a 3GB .csv file:

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 166 % validate -R pds4.label --skip-context-validation -r report_github326_label_valid_without_skip.json -s json -t /data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml [main] INFO gov.nasa.pds.tools.label.LocationValidator - location file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml [main] INFO gov.nasa.pds.tools.label.LocationValidator - Using validation style 'PDS4 Label' for location file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml [main] INFO gov.nasa.pds.tools.validate.task.ValidationTask - Starting validation task for location 'file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml' [main] INFO gov.nasa.pds.tools.validate.rule.pds4.TableFieldDefinitionRule - Label file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml does not contain any fields pertaining to Product_Observational/File_Area_Observational/Table_Character or Product_Observational/File_Area_Observational/Table_Character/Record_Character to valid ASCII field formats on ............................................................................................................................................................................................................................................................................................................................................................................................[main] INFO gov.nasa.pds.tools.validate.task.ValidationTask - Validation complete for location 'file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml' LabelUtil:reduceInformationModelVersionsCompleted execution in 335148 ms

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 167 % egrep "status|label" report_github326_label_valid_without_skip.json "ruleType": "pds4.label", "status": "PASS", "label": "file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml",

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 168 % grep file_name /data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml

very_large_file.csv

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 171 % grep records /data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml

38047299

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 169 % ls -l /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv -rw-r--r-- 1 qchau pds 3119878518 Apr 22 16:49 /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv {pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 170 %

The file has 38 million records

{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 170 % wc -l /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv 38047299 /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv

The run took about 15 minutes.

qchaupds commented 3 years ago

@msbentley Hi Mark, please drop me an email at Qui.T.Chau@jpl.nasa.gov and I'll send you the 2 files: 1 zip and 1 jar file. They are too big to include here.

If the email system does not work, we may have to come up with another method.

qchaupds commented 3 years ago

As of this afternoon, we have determined that because of the size of the label, everything takes much longer to do. Validating the file_size and checksum, and the schematron validation took about an hour.

The label from Mark had 869040 lines and it affected everything that has to do with reading in the XML, parsing it, building a tree from XML, search for tags, etc.

All issues with calculating the file_size, checksum, and record counts for very large files have been resolved now.

tloubrieu-jpl commented 3 years ago

@jordanpadams @qchaupds need to check if this ticket is closed

msbentley commented 3 years ago

I tested my large file with validate 2.0.3 and it works fine!

jordanpadams commented 3 years ago

Great news. Thanks @msbentley