Closed msbentley closed 3 years ago
@msbentley Mark, can you tell us the file extension of the file in question so we can target our development? The issue was fixed for some products but not all back in November of 2020. The one we are currently creating a test file for is .csv and it will take some time to create as there isn't one that large in our test artifacts. The one created yesterday is not a true .csv file as only the first 5 records had a carriage/line feed, the rest were random data and to perform a validation for CSV, we need to have all records comply.
Hi @qchaupds - this is actually a rather complicated file, which we are trying to discourage, but at the moment this is what we have - the label describes 26322 (!) Table_Character
tables each containing 120 records, each with a group field repetition (160 times) of a single field. The rationale was to try to encode a cube in ASCII, I believe. It is only a secondary data format, since the primary data are in an array, but the team wanted an ASCII alternative as well (and since arrays can only use binary types...). So we have 26k frames from a 120x160 detector, encoded as ASCII_Integer.
I could hazard a guess that it is the number of Table_Character
instances that is throwing validate, rather than the sheer size of the file...?
@msbentley Hi Mark, can you tell use the file extension you are using? Is it ".csv" or ".tab"? Validate uses the file extension to use the correct validator. The one we are trying out now is a .csv file (4GB) and is having good success so far.
Also @msbentley, inorder for us to test the file you have, we may need a copy of it on our system.
We can try to test with files we create but won't know until we try on the actual file it is failing for you.
@msbentley Since I don't have access to your test file, I would suggest the doing the following meanwhile:
Reduce the number of records to just a few, and run validate against that. If that works, then we know that it is not the format of the file that is failing. When we have the software you can use, you can run validate against the larger file.
In the next few days, we will try to create a more realistic .csv files with more valid records for validate to read.
Here's a recent successful run of a 3GB .csv file:
{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 166 % validate -R pds4.label --skip-context-validation -r report_github326_label_valid_without_skip.json -s json -t /data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml [main] INFO gov.nasa.pds.tools.label.LocationValidator - location file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml [main] INFO gov.nasa.pds.tools.label.LocationValidator - Using validation style 'PDS4 Label' for location file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml [main] INFO gov.nasa.pds.tools.validate.task.ValidationTask - Starting validation task for location 'file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml' [main] INFO gov.nasa.pds.tools.validate.rule.pds4.TableFieldDefinitionRule - Label file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml does not contain any fields pertaining to Product_Observational/File_Area_Observational/Table_Character or Product_Observational/File_Area_Observational/Table_Character/Record_Character to valid ASCII field formats on ............................................................................................................................................................................................................................................................................................................................................................................................[main] INFO gov.nasa.pds.tools.validate.task.ValidationTask - Validation complete for location 'file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml' LabelUtil:reduceInformationModelVersionsCompleted execution in 335148 ms
{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 167 % egrep "status|label" report_github326_label_valid_without_skip.json "ruleType": "pds4.label", "status": "PASS", "label": "file:/data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml",
{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 168 % grep file_name /data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml
{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 171 % grep records /data/home/pds4/qchau/test_artifacts/github326/spectra_data_collection_inventory.xml
{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 169 % ls -l /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv -rw-r--r-- 1 qchau pds 3119878518 Apr 22 16:49 /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv {pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 170 %
The file has 38 million records
{pds-dev3.jpl.nasa.gov}/home/qchau/sandbox/validate 170 % wc -l /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv 38047299 /data/home/pds4/qchau/test_artifacts/github326/very_large_file.csv
The run took about 15 minutes.
@msbentley Hi Mark, please drop me an email at Qui.T.Chau@jpl.nasa.gov and I'll send you the 2 files: 1 zip and 1 jar file. They are too big to include here.
If the email system does not work, we may have to come up with another method.
As of this afternoon, we have determined that because of the size of the label, everything takes much longer to do. Validating the file_size and checksum, and the schematron validation took about an hour.
The label from Mark had 869040 lines and it affected everything that has to do with reading in the XML, parsing it, building a tree from XML, search for tags, etc.
All issues with calculating the file_size, checksum, and record counts for very large files have been resolved now.
@jordanpadams @qchaupds need to check if this ticket is closed
I tested my large file with validate 2.0.3 and it works fine!
Great news. Thanks @msbentley
🐛 Describe the bug
When trying to validate a large (4.8 GB) data file, validate does not crash, but appears to show no output - after 15 minutes no "dots" or other output are displayed, but CPU usage is still high.
📜 To Reproduce
Steps to reproduce the behavior:
🕵️ Expected behavior
Some indication of progress is given (this is especially important for large files, and the chunk size must be appropriate to the expected duration of the validation)
📚 Version of Software Used
validate 2.0.2
🩺 Test Data / Additional context
Test data is proprietary and too large to upload here, but can be provided offline if required.
🖥 System Info
Most likely related to https://github.com/NASA-PDS/pds4-jparser/pull/31/files