Investigate if when multiple harvest run at the same time, there can be an issue with downloading the same LDD in parallel

NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).

https://nasa-pds.github.io/registry

Other

4 stars 3 forks source link

Investigate if when multiple harvest run at the same time, there can be an issue with downloading the same LDD in parallel #198

Open tloubrieu-jpl opened 1 month ago

tloubrieu-jpl commented 1 month ago

💡 Description

I am thinking of a possible writing conflict on the temporary LDD created locally.

⚔️ Parent Epic / Related Tickets

No response

al-niessner commented 1 month ago

@tloubrieu-jpl

Besides some strange log messages, not a problem.

The synchronization takes place in opensearch not out of it. The various harvests will write the same fields and type using batch. The first will succeed while the other will get a message back causing the Updated N fields to be smaller. Might be another message about already there but maybe not depending on various factors. Point is, harvest will just press forward with doing its job.

jordanpadams commented 1 month ago

@tloubrieu-jpl thoughts?

tloubrieu-jpl commented 1 month ago

@al-niessner , the json files are not downloaded before its content is processed ? If they are downloaded is there a risk that 2 harvest process download the LDD files at the same location ? So that one process tries to write the LDD file while another reads it.

That might not be a problem but I want to confirm that before we can close this ticket.

al-niessner commented 1 month ago

Harvest downloads the mapping contents at some point. Compares what it sees as new then batches those back up to mappings. It counts only those successful in the batch as being uploaded. That is why it does not matter; the batch uses opensearch batch mappings to use only one of multiple writes. We were testing it earlier before it was fixed because harvest was not reading the mappings correctly and kept sending the whole LDD but then had 0 updates. Hence, we already know it is not a problem.

tloubrieu-jpl commented 3 weeks ago

The log message found when job run in parallel is: [ERROR] Request failed: [resource_already_exists_exception] Update to the indices [geo-registry] failed due to either concurrent update or deletion of the indices

tloubrieu-jpl commented 3 weeks ago

Hi @al-niessner ,

Harvest need to support running in parallel, can you investigate the error that Dan received (see previous comment) when he did run multiple harvest in parallel and fix it.

Thanks,

Thomas

al-niessner commented 3 weeks ago

@tloubrieu-jpl

I need more of the log before that error. The error message above looks like the Java SDK V2 is throwing an exception and maybe more of the log would help me understand where. It is clearly not wanting to create something that is already there, but it may not be with the mapping.

tloubrieu-jpl commented 3 weeks ago

Hi @scholes-ds, could you attach or paste a longer section of the logs for this case so that @al-niessner can understand where the message comes from ?

Thanks

tloubrieu-jpl commented 3 weeks ago

Hi @al-niessner ,

Actually @scholes-ds gave me some context logs before, here they are:

[INFO] Updated 43 fields
[INFO] Processing product \\isilon-pri-data\pds-san\data\lunar\urn-nasa-pds-pioneer89cdd\calibration\p8_p9_cdd_calib_notebook.xml
[INFO] Updating LDDs.
[INFO] Updating 'pds' LDD. Schema location: http://pds.nasa.gov/pds4/pds/v1/PDS4_PDS_1700.xsd
[INFO] This LDD already loaded.
[INFO] Updating Elasticsearch schema.
[ERROR] Request failed: [resource_already_exists_exception] Update to the indices [geo-registry] failed due to either concurrent update or deletion of the indices

Let me know if you need more.

Thanks

al-niessner commented 3 weeks ago

@tloubrieu-jpl

This is enough. It looks like it was during the LDD update not push of document - why there is an overwrite flag. I will see if I can duplicate and ignore the error assuming nobody deleted the index.

tloubrieu-jpl commented 4 days ago

@al-niessner is having difficulties to reproduce this error.

tloubrieu-jpl commented 4 days ago

Until we have a better understanding of that issue, we advise discipline node not to run harvest processes in parallel.

tloubrieu-jpl commented 4 days ago

We can create a ticket with AWS to have a better understanding, @sjoshi-jpl ?