Closed tloubrieu-jpl closed 3 months ago
@tloubrieu-jpl
I do not understand problem statement. What are you running and what evidence do you see to support the claims?
Hi @al-niessner ,
Each time harvest runs, the log in stdout shows:
[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-10726032647501130848.JSON
Jul 29, 2024 12:40:21 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
Jul 29, 2024 12:40:21 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
[INFO] 404 - Not Found
[INFO] Will retry in 5 seconds
[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-10726032647501130848.JSON
Jul 29, 2024 12:40:27 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
Jul 29, 2024 12:40:27 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=lzCc2r04FyNKpzTJ9dSEQ0HNaXiP1phG9Pt2b1PUWq2r0++CCAeHq9aBDGXj7RqmHYwT0jfxasvS66G+t5SCiXpuMA5uZNaop+lk4l23Wz33keUuz7shSY2YPlu3; Expires=Mon, 05 Aug 2024 16:40:21 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:21 GMT
[INFO] 404 - Not Found
[INFO] Will retry in 5 seconds
[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-10726032647501130848.JSON
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=Bu4xtmZPA7wx/G/FFD6Kpb2x4QbdUgjcA+x6P8iHf65PDsGxNok1WaI2reeCG3RwGt0MKZS42nOMcCuHtz6zRnahb0aGfUZhcZrKc3SX0RyX3AsbarfcSjLVF5Nb; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=Bu4xtmZPA7wx/G/FFD6Kpb2x4QbdUgjcA+x6P8iHf65PDsGxNok1WaI2reeCG3RwGt0MKZS42nOMcCuHtz6zRnahb0aGfUZhcZrKc3SX0RyX3AsbarfcSjLVF5Nb; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
[INFO] 404 - Not Found
[ERROR] Could not download https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON
[WARN] Will use 'keyword' data type.
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-3957812975366445923.JSON
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
Jul 29, 2024 12:40:32 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
...
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-13819887072218788606.JSON
Jul 29, 2024 12:42:47 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALB=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
Jul 29, 2024 12:42:47 PM org.apache.http.client.protocol.ResponseProcessCookies processCookies
WARNING: Invalid cookie header: "Set-Cookie: AWSALBCORS=O9ab4l1XKNoZhh9mtNm7hDzzcOMInRHk61Xsqpt0V6tQ76E6pLxGacPqiuH4NXbKC8zvcwXstdfMRVQKWHRQVSr+98U7ynPXj9JWNTwpz/yDRNs2ksmUaY07s82h; Expires=Mon, 05 Aug 2024 16:40:32 GMT; Path=/; SameSite=None; Secure". Invalid 'expires' attribute: Mon, 05 Aug 2024 16:40:32 GMT
[INFO] Creating temporary ES data file /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-5263073179721369010.json
[INFO] Loading ES data file: /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-5263073179721369010.json
[ERROR] failed to upload all documents
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 49 fields
[INFO] Processing /Users/loubrieu/git/registry-ref-data/custom-datasets/urn-nasa-pds-insight_rad/data_calibrated/collection_data_rad_calibrated.xml
[INFO] Updating LDDs.
...
Before harvesting anything, it takes minutes to go through these LDD URLs.
I am really not sure about my initial statement in this ticket, that the issue is about a cache management which does not work. But something needs to be done to shorten this step.
@tloubrieu-jpl
Yes, I have seen those. they 404 messages like it is trying to load an LDD but fails so just keeps trying maybe. Will dive in to understand this better.
@tloubrieu-jpl
Got out the shovel and have a nice deep hole. Is this error because of the data contained in registry-ref-data? The code that is giving you the error has not changed in 2 years according to blame.
The code takes a valid XSD URL from a product and tries to load the opensearch schema in a JSON file from the same area. I assume that more modern XSD or something has so been translated and would work. I took the XSD URL (https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.xsd) and put it into my browser and it loads. Change to json and/or JSON and 404. So it is real. The code has expected for the last two years that this relationship exists and works. Hence, there does not seem to be anything to fix here but maybe a problem with registry-ref-data.
It sounds like the errors we are getting come from: 1) not accessible LDD
[INFO] Downloading https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-35954800810348187.JSON
[INFO] 404 - Not Found
[ERROR] Could not download https://pds.nasa.gov/pds4/mission/insight/v1/PDS4_INSIGHT_1A10_1830.JSON
[WARN] Will use 'keyword' data type.
2) unparsable LDD
[INFO] Updating 'pds' LDD. Schema location: https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.xsd
[INFO] Downloading https://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1A10.JSON to /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/LDD-889613878230952996.JSON
[INFO] Creating temporary ES data file /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-9477954201964317048.json
[INFO] Loading ES data file: /var/folders/mw/1y5l8kz55h94xvjy67f2v8rh0000gp/T/es-9477954201964317048.json
[ERROR] failed to upload all documents
[WARN] Will use field definitions from [PDS4_PDS_1500.JSON, PDS4_PDS_1F00.JSON]
[INFO] Updating Elasticsearch schema.
[INFO] Updated 49 fields
All of them are retried each time harvest runs and go through them for all the impacted products. Maybe the dataset I am using for test is obsolete, has too much wrong LDD referenced in it. See https://github.com/NASA-PDS/registry-ref-data/tree/main/custom-datasets
This is not a bug. I will close this ticket and create a new one to check the LDD references in the reference test dataset.
New ticket is https://github.com/NASA-PDS/registry-ref-data/issues/7
I am re-opening because the user complain about it.
@al-niessner I believe we should investigate how harvest can cache the LDD which have been retrieved before so that we don't try to download them for each product of a collection.
This could also apply to unreachable LDD. We can try and re-try to get them when they are first seen in the harvest job, but then, if they are found again, we should not retry to fetch them.
💡 Description
I believe with the previous version of harvest, we were keeping track of when the LDD files have been last updated (in Opensearch ?) and we were loading them only as needed.
It sounds like now they are always loaded, which takes a lot of time.
⚔️ Parent Epic / Related Tickets
No response