NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

Run the synchronization of LDD as needed only #159

Closed tloubrieu-jpl closed 3 months ago

tloubrieu-jpl commented 4 months ago

💡 Description

I believe with the previous version of harvest, we were keeping track of when the LDD files have been last updated (in Opensearch ?) and we were loading them only as needed.

It sounds like now they are always loaded, which takes a lot of time.

⚔️ Parent Epic / Related Tickets

No response

al-niessner commented 4 months ago

@tloubrieu-jpl

I do not understand problem statement. What are you running and what evidence do you see to support the claims?

tloubrieu-jpl commented 3 months ago

Hi @al-niessner ,

Each time harvest runs, the log in stdout shows:

[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-10726032647501130848.JSON
Jul 29, 2024 12:40:21 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
Jul 29, 2024 12:40:21 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
[INFO] 404 - Not Found
[INFO] Will retry in 5 seconds
[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-10726032647501130848.JSON
Jul 29, 2024 12:40:27 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
Jul 29, 2024 12:40:27 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
[INFO] 404 - Not Found
[INFO] Will retry in 5 seconds
[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-10726032647501130848.JSON
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=Bu4xtmZPA7wx/G/FFD6Kpb2x4QbdUgjcA+x6P8iHf65PDsGxNok1WaI2reeCG3RwGt0MKZS42nOMcCuHtz6zRnahb0aGfUZhcZrKc3SX0RyX3AsbarfcSjLVF5Nb; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=Bu4xtmZPA7wx/G/FFD6Kpb2x4QbdUgjcA+x6P8iHf65PDsGxNok1WaI2reeCG3RwGt0MKZS42nOMcCuHtz6zRnahb0aGfUZhcZrKc3SX0RyX3AsbarfcSjLVF5Nb; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
[INFO] 404 - Not Found
[ERROR] Could not download https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON
[WARN] Will use 'keyword' data type.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-3957812975366445923.JSON
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
...
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-13819887072218788606.JSON
Jul 29, 2024 12:42:47 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
Jul 29, 2024 12:42:47 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
[INFO] Creating temporary ES data file /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-5263073179721369010.json
[INFO] Loading ES data file: /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-5263073179721369010.json
[ERROR] failed to upload all documents
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 49 fields
[INFO] Processing /Users/loubrieu/git/registry-ref-data/custom-datasets/urn-nasa-pds-insight_rad/data_calibrated/collection_data_rad_calibrated.xml
[INFO] Updating LDDs.
...

Before harvesting anything, it takes minutes to go through these LDD URLs.

I am really not sure about my initial statement in this ticket, that the issue is about a cache management which does not work. But something needs to be done to shorten this step.

al-niessner commented 3 months ago

@tloubrieu-jpl

Yes, I have seen those. they 404 messages like it is trying to load an LDD but fails so just keeps trying maybe. Will dive in to understand this better.

al-niessner commented 3 months ago

@tloubrieu-jpl

Got out the shovel and have a nice deep hole. Is this error because of the data contained in registry-ref-data? The code that is giving you the error has not changed in 2 years according to blame.

https://github.com/NASA-PDS/registry-common/blame/01ffab0674cdf5f5fe7bc6b3802dbb44d7ec6e1b/src/main/java/gov/nasa/pds/registry/common/es/service/SchemaUpdater.java#L103

The code takes a valid XSD URL from a product and tries to load the opensearch schema in a JSON file from the same area. I assume that more modern XSD or something has so been translated and would work. I took the XSD URL (https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.xsd) and put it into my browser and it loads. Change to json and/or JSON and 404. So it is real. The code has expected for the last two years that this relationship exists and works. Hence, there does not seem to be anything to fix here but maybe a problem with registry-ref-data.

tloubrieu-jpl commented 3 months ago

It sounds like the errors we are getting come from: 1) not accessible LDD

[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-35954800810348187.JSON
[INFO] 404 - Not Found
[ERROR] Could not download https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON
[WARN] Will use 'keyword' data type.

2) unparsable LDD

[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-889613878230952996.JSON
[INFO] Creating temporary ES data file /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-9477954201964317048.json
[INFO] Loading ES data file: /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-9477954201964317048.json
[ERROR] failed to upload all documents
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 49 fields

All of them are retried each time harvest runs and go through them for all the impacted products. Maybe the dataset I am using for test is obsolete, has too much wrong LDD referenced in it. See https://github.com/NASA-PDS/registry-ref-data/tree/main/custom-datasets

tloubrieu-jpl commented 3 months ago

This is not a bug. I will close this ticket and create a new one to check the LDD references in the reference test dataset.

tloubrieu-jpl commented 3 months ago

New ticket is https://github.com/NASA-PDS/registry-ref-data/issues/7

tloubrieu-jpl commented 1 month ago

I am re-opening because the user complain about it.

@al-niessner I believe we should investigate how harvest can cache the LDD which have been retrieved before so that we don't try to download them for each product of a collection.

This could also apply to unreachable LDD. We can try and re-try to get them when they are first seen in the harvest job, but then, if they are found again, we should not retry to fetch them.