NASA-PDS / validate

Validates PDS4 product labels, data and PDS3 Volumes
https://nasa-pds.github.io/validate/
Apache License 2.0
16 stars 11 forks source link

Validate error reading tables > 2GiB #189

Closed sslavney closed 3 years ago

sslavney commented 4 years ago

Describe the bug Validate 1.18.2 gives the error message "ERROR [error.table.bad_file_read] table 4: Error occurred while trying to read table: null" when running content validation on a large data file (2.5 GB) containing multiple binary tables.

To Reproduce Steps to reproduce the behavior:

  1. Download the data file and its label from https://pds-geosciences.wustl.edu/messenger/mess-h-rss_mla-5-sdp-v1/messrs_1001/data/shbdr/jgmess_160av01_shb.dat and https://pds-geosciences.wustl.edu/messenger/mess-h-rss_mla-5-sdp-v1/messrs_1001/data/shbdr/jgmess_160av01_shb.xml.

  2. Run Validate version 1.18.2 with this command: validate jgmess_160av01_shb.xml -R pds4.label -v2 -r validate_output.txt

  3. This is what appears on the screen: "Feb 20, 2020 10:54:06 AM com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector INFO: The optimized code generation is disabled"

  4. This is the error message in the output file: "ERROR [error.table.bad_file_read] table 4: Error occurred while trying to read table: null" The complete output file is attached. validate182_shbdr_error.txt

Expected behavior I expected the contents of file to be valid because they appear to be correct in the PDS4 Viewer, and because they appear to be correct in NASAView when read via a PDS3 label. (This is a migrated MESSENGER product that has both a PDS3 and a PDS4 label.)

Version of Software Used Version 1.18.2

Desktop (please complete the following information):

Additional context When run without content validation, Validate reports no errors for this product.

Related to NASA-PDS-Incubator/transform#2

jordanpadams commented 4 years ago

@sslavney apologies here. Validate has a known issue reading in files >2GB. we will work to update the software to better handle this error and hopefully come up with a solution.

josinde commented 4 years ago

Same issue for us using an older version (1.16.0-20190718-e5b39a1) and we confirm this is an issue with very large files. We are now integrating release 1.20.0. Do you want us to check this issue against this other version?

2020-02-05 04:05:19 INFO > Executing job with groupId = Packaging and jobId = ProductPackagerJob_1577bbfd-68f0-4d6f-b419-c69de6fa854c 2020-02-05 04:05:19 INFO Package temporary folder created: /tmp/em16_packager_12473380658639804638/em16psa-pds4-pi-01-em16_tgo_acs-20200205T030519516 2020-02-05 04:05:32 INFO Using validation style 'PDS4 Label' for location file:/tmp/em16_packager_12473380658639804638/em16psa-pds4-pi-01-em16_tgo_acs-20200205T030519516/em16_tgo_acs/Commissioning_and_Verification/acs_raw_sc_tir_20180319T115959-20180319T155944-1425-1.xml 2020-02-05 04:05:32 INFO Starting validation task for location 'file:/tmp/em16_packager_12473380658639804638/em16psa-pds4-pi-01-em16_tgo_acs-20200205T030519516/em16_tgo_acs/Commissioning_and_Verification/acs_raw_sc_tir_20180319T115959-20180319T155944-1425-1.xml' 2020-02-05 04:07:31 INFO Validation complete for location 'file:/tmp/em16_packager_12473380658639804638/em16psa-pds4-pi-01-em16_tgo_acs-20200205T030519516/em16_tgo_acs/Commissioning_and_Verification/acs_raw_sc_tir_20180319T115959-20180319T155944-1425-1.xml' 2020-02-05 04:07:31 INFO PDS Validation Tool 1.16.0-20190718-e5b39a1 Report:

FAIL: file:/tmp/em16_packager_12473380658639804638/em16psa-pds4-pi-01-em16_tgo_acs-20200205T030519516/em16_tgo_acs/Commissioning_and_Verification/acs_raw_sc_tir_20180319T115959-20180319T155944-1425-1.xml

ERROR: error.table.bad_file_read - Error occurred while trying to read table: capacity < 0: (-1395781509 < 0). file:/tmp/em16_packager_12473380658639804638/em16psa-pds4-pi-01-em16_tgo_acs-20200205T030519516/em16_tgo_acs/Commissioning_and_Verification/acs_raw_sc_tir_20180319T115959-20180319T155944-1425-1-BBBB.tab (line = 0, column = 0)

josinde commented 4 years ago

Confirmed also as an issue for PDS Validation Tool 1.20.0. Cheers

jordanpadams commented 4 years ago

@sslavney @josinde as a note, this directly relates to NASA-PDS-Incubator/transform#2 which also uses the underlying PDS4-JParser library. We will add this to our release plan for next build as this may require some significant overhaul of the underlying library

msbentley commented 4 years ago

Is there a "known" limit to the size of files that validate can handle? i.e. is it exactly 2GB? Just to be aware and programmatically skip such files if needs be (or disable content validation)

Also, does it affect all data types (i.e. Table_Character and Table_Binary as well etc.)

jordanpadams commented 4 years ago

@mcayanan do you remember the details for this 2GB cap? or is it more ~2GB?

mcayanan commented 4 years ago

@jordanpadams Unfortunately no. I would recommend in the code throwing a stacktrace at

https://github.com/NASA-PDS-Incubator/validate/blob/5f3b28c76a8f87787d6a502e94223294d83a6802/src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableDataContentValidationRule.java#L204

to see where the error is coming from. That might jog my brain cells. :)

jordanpadams commented 4 years ago

@mcayanan

$ /Users/jpadams/Documents/proj/pds/pdsen/workspace/validate/validate-1.21.0-SNAPSHOT/bin/validate -t jgmess_160av01_shb.xml
...................................................java.lang.IllegalArgumentException
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
    at gov.nasa.pds.objectAccess.ByteWiseFileAccessor.<init>(ByteWiseFileAccessor.java:126)
    at gov.nasa.pds.objectAccess.TableReader.<init>(TableReader.java:126)
    at gov.nasa.pds.tools.validate.content.table.RawTableReader.<init>(RawTableReader.java:61)
    at gov.nasa.pds.tools.validate.rule.pds4.TableDataContentValidationRule.validateTableDataContents(TableDataContentValidationRule.java:189)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at gov.nasa.pds.tools.validate.rule.AbstractValidationRule.execute(AbstractValidationRule.java:63)
    at org.apache.commons.chain.impl.ChainBase.execute(ChainBase.java:191)
    at gov.nasa.pds.tools.validate.task.ValidationTask.execute(ValidationTask.java:134)
    at gov.nasa.pds.tools.validate.task.BlockingTaskManager.submit(BlockingTaskManager.java:27)
    at gov.nasa.pds.tools.label.LocationValidator.validate(LocationValidator.java:163)
    at gov.nasa.pds.validate.ValidateLauncher.doValidation(ValidateLauncher.java:1226)
    at gov.nasa.pds.validate.ValidateLauncher.processMain(ValidateLauncher.java:1423)
    at gov.nasa.pds.validate.ValidateLauncher.main(ValidateLauncher.java:1466)
PDS Validate Tool Report
jordanpadams commented 4 years ago

Looks like the buffer.allocate tries to allocate the entire size of the file, when it probably needs to be chunked

mcayanan commented 4 years ago

@jordanpadams Ya that looks to be the issue. Specifically, it's trying to allocate a total size of 2,687,074,568 bytes, which is greater than the max int value. I forgot exactly how large arrays (greater than 2GB) are being handled currently, but I would imagine the tool should be updated similarly on the large table end.

jordanpadams commented 4 years ago

Thanks @mcayanan . I knew I saw us buffering somewhere else in the code so this just needs to be updated to do the same. Thanks for the tip!

jordanpadams commented 4 years ago

Duplicate of NASA-PDS/pds4-jparser#21

jordanpadams commented 3 years ago

@hhlee445 see comments above: https://github.com/NASA-PDS/validate/issues/189#issuecomment-595465083 and https://github.com/NASA-PDS/validate/issues/189#issuecomment-595847112

msbentley commented 3 years ago

@mcayanan do you remember the details for this 2GB cap? or is it more ~2GB?

Hi @jordanpadams would it be possible to clarify this? We need to work around by programmatically skipping validation for products over this threshold, and it would be good to confirm the exact value!

jordanpadams commented 3 years ago

@msbentley unfortunately this is a tough thing to test for exactness. @hhlee445 is in the process of implementing a fix as we speak, so if you can wait another week or 2, we may be able to use that version of the software.

msbentley commented 3 years ago

OK, thanks @jordanpadams and @hhlee445 - I can wait :+1:

jordanpadams commented 3 years ago

@msbentley @sslavney this should now be fixed. feel free to try out the latest snapshot version of validate here:

https://github.com/NASA-PDS/validate/releases/tag/1.25.0-SNAPSHOT

thanks to @hhlee445 for the excellent work here!

Note: this update did not make it into Build 11.0 I&T, and since it is a pretty significant change to how we read in data files, I would prefer it be rigorously tested prior to an official release in the Spring.