OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

Integrate METS Server #966 with Web API archictecture #1035

Open kba opened 1 year ago

kba commented 1 year ago

image

@bertsky:

METS server implementation is still ongoing, but if you follow my advice, and agree with my interpretation…

I believe the METS server is an internal detail of the Workers which allows a multitude of threads/processes working on the same workspace. Of course, we could manage central METS servers from the Processing Server, too. (That would make it an overt component between those two, but independent of the Message Queue.)

…then it would indeed make sense to place the METS Server on the diagram.

Since we would have the Processing Server manage (instantiate, start up, pass over in the queue messages along with the workspace ID/path, and tear down), and the Workers then call, an open swarm of METS Servers, which mediate between workspace and file system, I suggest adding the METS Server node-set to the right of the Worker node-set: with incoming arrows from the queue and the workers, and an outgoing arrow to the file system (denoting METS synchronisation). Perhaps there could even be an arrow to the MongoDB (if we want to map METS server instances in the Workspace model of the database).

@tdoan2010:

Robert's idea is the same as mine, so I completely agree.

A METS Server is basically the same as a Processing Worker. It listens to a specific queue (e.g., mets-operations queue), receives messages from the queue and updates the METS file accordingly.

@bertsky:

Yes, that's a good way to put it. A workspace-dedicated, specialised worker. But its lifetime is different than a Processing Worker – it gets set up and torn down for a /processing or /workflow request, not for the whole server uptime.

@MehmedGIT:

It listens to a specific queue (e.g., mets-operations queue), receives messages from the queue and updates the METS file accordingly.

  1. I think this should be thought through properly. A separate message queue will be needed for each workspace instance. Is that ideal?
  2. This would also mean that the METS Server has to use RabbitMQ Publisher internally. I don't think this would be ideal since parts of ocrd_network will be introduced inside ocrd.

Of course, unless, the METS server is implemented in ocrd_network as an extension of the ocrd.Workspace instead of complicating the ocrd.Workspace as it currently feels in #966. But the main reason for #966 was also to consider the HPC cases where a RabbitMQ is no longer good.

@bertsky:

  1. A separate message queue will be needed for each workspace instance. Is that ideal?
  2. This would also mean that the METS Server has to use RabbitMQ Publisher internally

Definitely not ideal. Plus the fact that we would need to manage these queues (set up + tear down) with the workflow lifetime.

I'd rather have this separate and independent, as started in #966. Still, we should anticipate the changes required here (in the Processing Server, after this is merged and METS Server gets finished): The METS Server should be another parameter of the job (next to the workspace id/path), and (unless the Workflow Server can do this by itself) there would have to be an endpoint for starting them up and tearing it down.

@MehmedGIT:

definitely not ideal. Plus the fact that we would need to manage these queues (set up + tear down) with the workflow lifetime.

Exactly. Simply pushing/pulling everything from a queue is not good.

I'd rather have this separate and independent, as started in #966.

Agree.

Still, we should anticipate the changes required here (in the Processing Server, after this is merged and METS Server gets finished)

Sure. The ocrd_network will continue getting developed. The reference WebAPI implementation will be transfered to ocrd_network ASAP once this PR is merged.

The METS Server should be another parameter of the job (next to the workspace id/path), and (unless the Workflow Server can do this by itself) there would have to be an endpoint for starting them up and tearing it down.

I think the Deployer agent should be separated by the Processing Server in the OCR-D System Architecture. On an implementation level, it is already separated. Makes the code easier to follow. The Workflow Server can then start/stop the METS Servers for each workspace through the Deployer.

For the HPC environment, the batch script used to trigger the Nextflow workflow can be used to start the METS server before triggering the workflow, and stop it after the workflow finishes.

I think the Deployer agent should be separated by the Processing Server in the OCR-D System Architecture. On an implementation level, it is already separated. Makes the code easier to follow. The Workflow Server can then start/stop the METS Servers for each workspace through the Deployer.

Yes, makes sense!

bertsky commented 1 year ago

I think the Deployer agent should be separated by the Processing Server in the OCR-D System Architecture. On an implementation level, it is already separated. Makes the code easier to follow.

The deployer can now skip deployment of the database and queue, but not of the workers themselves, and we do not have deployment endpoints yet.

The Workflow Server can then start/stop the METS Servers for each workspace through the Deployer.

For the HPC environment, the batch script used to trigger the Nextflow workflow can be used to start the METS Server before triggering the workflow, and stop it after the workflow finishes.

Both use-cases are valid for external control of the METS Server lifetime.

But also for the Processing Server itself, internally, if processing requests are split up into single pages or subranges of pages, spawning a METS Server temporarily (if not already running) would make sense.

Using the METS Server as a synchronization mechanism could also be an option to implement #1046 – at least, these goals are related.