NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

m2020 PIXL bundle is not properly loading all collections/products #143

Closed jordanpadams closed 7 months ago

jordanpadams commented 10 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I tried to load m2020 PIXL bundle, it doesn't load the urn:nasa:pds:mars2020_pixl:data_oxides_pmc collection.

🕵️ Expected behavior

All products within this collection to be loaded.

📜 To Reproduce

2023-11-28 12:33:03,517 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_oxides_pmc\collection_data_oxides_pmc.xml
2023-11-28 12:33:04,486 [INFO] Wrote 1 collection inventory document(s)

Most other collections have more documents like:

2023-11-28 12:33:07,252 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_ancillary\collection_data_raw_ancillary.xml
2023-11-28 12:33:09,596 [INFO] Wrote 20 collection inventory document(s)

Running registry-mgr on the bundle gives:

[INFO] Setting product status. LIDVID = urn:nasa:pds:mars2020_pixl::2.0, status = archived
[WARN] Collection urn:nasa:pds:mars2020_pixl:data_imaging::8.0 doesn't have primary products.
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0558_0716511093_000rqa__02800001941181450000___j01::1.0]: document missing
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0558_0716511093_000rqb__02800001941181450000___j01::1.0]: document missing
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0558_0716511093_000rqc__02800001941181450000___j01::1.0]: document missing
[ERROR] [_doc][urn:nasa:pds:mars2020_pixl:data_oxides_pmc:ps__0560_0716654103_000rqa__02800001947735050000___j01::1.0]: document missing
...

🖥 Environment Info

Windows Enterprise Server

📚 Version of Software Used

Harvest version: 3.9.0-SNAPSHOT Build time: 2023-10-31T16:02:29Z

🩺 Test Data / Additional context

https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-mars2020_pixl/

🦄 Related requirements

All

⚙️ Engineering Details

No response

al-niessner commented 10 months ago

@jordanpadams

The whole bundle (complete test) requires about 4 GB of disk space. Are we going to want this as test via postman etc?

jordanpadams commented 10 months ago

@al-niessner Not the entire data set. Hopefully we can pick out a representative test case to include, if possible.

al-niessner commented 10 months ago

@jordanpadams

Was not expecting problems so did not save to log, but validate does not pass:

Summary:

  2 error(s)
  829 warning(s)

  Product Validation Summary:
    11137      product(s) passed
    1          product(s) failed
    0          product(s) skipped

  Referential Integrity Check Summary:
    11138      check(s) passed
    0          check(s) failed
    0          check(s) skipped

  Message Types:
    2            error.pdf.file.not_pdfa_compliant
    829          warning.integrity.reference_not_found

End of Report
Completed execution in 12324484 ms

Will rerun and detail problems to make sure they are not the cause. Will report back after it has run again.

al-niessner commented 10 months ago

@jordanpadams @scholes-ds

Sorry, but I cannot reproduce this problem. I used wget to grab the bundle and bring it local to my machine for testing.

While validate finds a failure:

  FAIL: file:/home/niessner/Projects/PDS/validate/src/test/resources/harvest141/document/pixl_edr_sis.xml
      ERROR  [error.pdf.file.not_pdfa_compliant]   Validation failed for flavour PDF/A-1b in file mars2020_pixl_labels_sort_pds.pdf.
      ERROR  [error.pdf.file.not_pdfa_compliant]   Validation failed for flavour PDF/A-1b in file mars2020_pixl_labels_sort_vicar.pdf.

harvest simply does not care about it. Nor is it in this tickets purview to care about PDF. Point is, harvest and validate agree on the total number of files (see comment above for validate numbers):

[SUMMARY] Summary:
[SUMMARY] Skipped files: 0
[SUMMARY] Loaded files: 11138
[SUMMARY]   Product_Bundle: 1
[SUMMARY]   Product_Collection: 6
[SUMMARY]   Product_Document: 5
[SUMMARY]   Product_Observational: 11126
[SUMMARY] Failed files: 0
[SUMMARY] Package ID: 4976b2fd-e5bf-4a7d-a685-6cf43f3adb80

No matter how many times I wipe and fiddle with the configuration file, all files load. I do not see the missing references when changing the state from staged to archived. Even dug in to the code to watch it through the debugger and there absolutely nothing that would make it not descend into the data oxides sol directories.

What was reported in users harvest log:

2023-11-28 12:53:06,196 [SUMMARY] Summary:
2023-11-28 12:53:06,196 [SUMMARY] Skipped files: 0
2023-11-28 12:53:06,196 [SUMMARY] Loaded files: 11036
2023-11-28 12:53:06,196 [SUMMARY]   Product_Bundle: 1
2023-11-28 12:53:06,196 [SUMMARY]   Product_Collection: 6
2023-11-28 12:53:06,196 [SUMMARY]   Product_Document: 5
2023-11-28 12:53:06,196 [SUMMARY]   Product_Observational: 11024
2023-11-28 12:53:06,196 [SUMMARY] Failed files: 1
2023-11-28 12:53:06,196 [SUMMARY] Package ID: c4a137a0-88cc-49be-96f9-17f4222a0a50

I should note that the failed files may be a batch of them not a single file but I am not sure why it says read when it writes int batch. Also, the difference between files loaded 11138 - 11036 is 102 the missing data oxides sol files. Same with Product_Observational.

To be clear, lets cover what actually happens. harvest first searches the given directory and no other (does not descend) for the bundle. It finds it, reads, and pushes into the DB. Once the bundle is done it then descends to all directories below the one given for collections. Both harvest log files show the same collections being processed:

2023-11-28 12:32:59,517 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_imaging\collection_data_imaging.xml
2023-11-28 12:33:03,471 [INFO] Wrote 26 collection inventory document(s)
2023-11-28 12:33:03,517 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_oxides_pmc\collection_data_oxides_pmc.xml
2023-11-28 12:33:04,486 [INFO] Wrote 1 collection inventory document(s)
2023-11-28 12:33:05,127 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_processed\collection_data_processed.xml
2023-11-28 12:33:05,486 [INFO] Wrote 1 collection inventory document(s)
2023-11-28 12:33:07,252 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_ancillary\collection_data_raw_ancillary.xml
2023-11-28 12:33:09,596 [INFO] Wrote 20 collection inventory document(s)
2023-11-28 12:33:49,847 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_spectroscopy\collection_data_raw_spectroscopy.xml
2023-11-28 12:33:50,644 [INFO] Wrote 2 collection inventory document(s)
2023-11-28 12:33:54,019 [INFO] Processing collection \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\document\collection_document.xml

The numbers that it writes are array blocks of 500 references or less, which is why they vary. It then goes on to descend through all directories again and processes all products - it uses incredibly generic java code provided by the JDK to find all the files so incredibly unlikely to be influenced by this specific use case. Since neither harvest configuration file, one used for testing and one provided by user, has include/exclude filters all non-bundle and non-collection items are processed. In my log file it looks like:

[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00125/ps__0125_0678032243_000rqb__00417120483005510000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00125/ps__0125_0678032243_000rqc__00417120483005510000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00125/ps__0125_0678032243_000rqc__00417120483005510000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqa__00518120528225320000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqa__00518120528225320000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqb__00518120528225320000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqb__00518120528225320000___j01.xml
/home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqc__00518120528225320000___j01.xml
[INFO] Processing product /home/niessner/Projects/PDS/harvest/src/test/resources/github143/data_oxides_pmc/sol_00138/ps__0138_0679216551_000rqc__00518120528225320000___j01.xml

Processing of these files is obviously missing from the one given to us via email. The products are then batched into the DB. If the batch write is not successful, then should see error messages like:

2023-11-28 12:42:16,507 [INFO] Processing product \\isilon-pri-data\pds-san\data\m2020\urn-nasa-pds-mars2020_pixl\data_raw_ancillary\sol_00489\PE__0489_0710387107_000E08__02610041706562580025___J04.xml
2023-11-28 12:42:21,944 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (5 retries remaining)
2023-11-28 12:42:27,616 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (4 retries remaining)
2023-11-28 12:42:33,335 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (3 retries remaining)
2023-11-28 12:42:39,053 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (2 retries remaining)
2023-11-28 12:42:44,788 [WARN] DataLoader.loadBatch() request failed due to "Read timed out" (1 retries remaining)
2023-11-28 12:42:50,600 [ERROR] Read timed out

However it will show that it descends into the directory even when it fails with a timeout. It is quite possible that this batch of data did not make it into the DB but they would be part of data_raw_ancillary not data_oxides_pmc. Using a local DB for testing; never experienced this failure. The other interesting part is that I get a lot of messages that are not seen in user supplied harvest log:

[INFO] Updating 'mars2020' LDD. Schema location: https://pds.nasa.gov/pds4/mission/mars2020/v1/PDS4_MARS2020_1G00_1000.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/mission/mars2020/v1/PDS4_MARS2020_1G00_1000.JSON to /tmp/LDD-12555520411772030214.JSON
Dec 01, 2023 12:54:09 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=YVxwgger9zQer3z+a4LGBwjyIYQem/b+KtkIIiMSNjGAJjmhuZKKFwu1CKXV6FRWKcVxMr9hfwFGqpFZTKKucMW4Tbu+z+2fUizHlF/jGpvQl9UHIoPJtqj6P31i; Expires=Fri, 08 Dec 2023 18:30:58 GMT; Path=/". Invalid 'expires' attribute: Fri, 08 Dec 2023 18:30:58 GMT
Dec 01, 2023 12:54:09 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=YVxwgger9zQer3z+a4LGBwjyIYQem/b+KtkIIiMSNjGAJjmhuZKKFwu1CKXV6FRWKcVxMr9hfwFGqpFZTKKucMW4Tbu+z+2fUizHlF/jGpvQl9UHIoPJtqj6P31i; Expires=Fri, 08 Dec 2023 18:30:58 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Fri, 08 Dec 2023 18:30:58 GMT
[INFO] Creating temporary ES data file /tmp/es-6622149530635502458.json
[INFO] Loading ES data file: /tmp/es-6622149530635502458.json
[INFO] Loaded 321 document(s)

Do not know if this is an environment difference or harvest version difference. Given the code, doubt it adds or subtracts from the reported problem. However it might if the harvest used by the user is sufficiently old and that code is not so generic java.

Leaves just one other option. The data oxides sol directories were not present when harvest was run. Not present could simply mean not readable by the user running harvest. Since I grabbed them from the net, the server could be storing them in a completely different file tree and the wget recombined them into one tree. Could invent plenty more similar stories for "they looked to be there but were not when harvest ran" but like the ones here, they are just stories.

jordanpadams commented 9 months ago

@scholes-ds if you are still encountering this issue with the latest snapshot of harvest, let us know and we will move this data set over to a Windows VM to see if we can reproduce over there.

jordanpadams commented 7 months ago

Closing as invalid for the time being, but will re-open if this is still an issue.

jordanpadams commented 7 months ago

Confirmed with user this is no longer and issue