OCR-D / zenhub

Repo for developing zenhub integration
Apache License 2.0
0 stars 0 forks source link

Message structure #139

Open tdoan2010 opened 1 year ago

tdoan2010 commented 1 year ago

We need to come up with a schema for the message which submit to the queue. There might be different type of message, e.g. processing, result, error, logging, etc.

Use the ocrd_tool.schema.yml as a reference implementation to write this schema.

MehmedGIT commented 1 year ago

Processing request message example:

# Generated fields
processor-name: "ocrd-.*" # e.g., "ocrd-cis-ocropy-binarize", not necessary for processing, but might be useful for human
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd", generated by the Processing Broker

# Mandatory fields - provided by the User/Workflow Server to the Processing Broker
input_file_grp: "OCR-D-.*" # e.g., "OCR-D-DEFAULT"
workspace: # either of the two fields must be available
  id: "uuid" # e.g., "4a5795d6-136c-40b7-8eed-eeca8ff1c249", generated by the Workspace Server
  path_to_mets: "absolute path" # e.g., "/workspaces/workspace1/mets.xml", locally available on the Processing Worker

# Optional fields - provided by the User/Workflow Server to the Processing Broker
output_file_grp: "OCR-D-.*" # e.g., "OCR-D-BIN"
page_id: "id" # e.g., "PHYS_0005..PHYS_0010" will process only pages between 5-10
overwrite: boolean
parameter:
  - key_1: value
  - ..
  - key_N: value
log_level: "level" # e.g., "INFO"
# "save the resource usage of the processor to a file in the root directory"
save_profiling_to: "path" # e.g., "./logs/resource-usage.txt"

# If this field is not provided, do not push results back to the message queue
result_queue: "name" # e.g., "ocrd-cis-ocropy-binarize-result"

Result response message example:

# Mandatory fields - reused from the processing request message or generated by the Processing Worker
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"
status: "value" # e.g., completed/failed

# Optional fields - available only if provided in the processing request message
path_to_mets: "absolute path"

EDIT: This post gets updated and contains the latest version of the message drafts.

tdoan2010 commented 1 year ago

Sorry, I find it a bit difficult to read your draft. What syntax is it? Could you just make one or some message examples?

My first comment is: why do we need to include message_type in the message? Different types will be pushed to different queues. So for me, message_type is not necessary.

MehmedGIT commented 1 year ago

why do we need to include message_type in the message? Different types will be pushed to different queues. So for me, message_type is not necessary.

It just represents the two message types, it's not a field. Okay, let me edit that and replace the regex patterns with examples.

MehmedGIT commented 1 year ago

processing message type:


workspace_ref: &reference
ocrd_processor_name: ocrd-cis-ocropy-binarize
  parameters:
    input_filegrp: OCR-D-MIN
    output_filegrp: OCR-D-BIN
    page_id: PHYS0010-PHYS0015
    other_paramters: {}
push_b: true
  MQ_res: &queueName
retries: 3
job_id: 1024 (auto generated)
timestamp: 20221118_153107 (auto generated)
message_hash: {} (auto generated based on the fields above)

result message type:

job_id: 1024
  job_status: completed
  return_code: 0
err_ref: &reference
log_ref: &reference
tdoan2010 commented 1 year ago

I would reuse the structure of the POST request to the processing endpoint of the Web API as much as possible. So, I suggest a message structure like this:

# This is not necessary for processing, but might be useful for human
processor_name: ocrd-cis-ocropy-binarize

path: /path/to/mets.xml
description: some text description here
input_file_grps: 
  - OCR-D-INPUT
output_file_grps:
  - OCR-D-OUTPUT
page_id: PHYS_001,PHYS_002
parameters:
  params_1: 1
  params_2: 2

 # If this field is not provided, do not push results back to the message queue
result_queue_name: ocrd-cis-ocropy-binarize-result

# The time this message is created in Unix time
created_time: 1668782988590

The job_id should be the ID of the document in MongoDB. When a processing server gets a message, it should first create an entry in the MongoDB (with status running) and use the ID of that entry as job_id for the result message (if a result message is required).

MehmedGIT commented 1 year ago
  1. About the job_id, yes, that is what I meant with the (auto-generated) it will be a uuid generated id. The way in which we currently generate ids for workspace and workflow uploads.
  2. Created time - yes, this is a more technical approach. I just wanted to produce a human-readable timestamp.
MehmedGIT commented 1 year ago

After rechecking the specs and reconsidering some things in more depth, here is an update that, hopefully, covers all possible use cases. I will edit my first post and pin the examples to the top if there are no current objections. @tdoan2010

Processing request message example:

# Generated fields - automatically generated by the Processing Broker
# This is not necessary for processing, but might be useful for human
processor-name: "ocrd-.*" # e.g., "ocrd-cis-ocropy-binarize"
job-id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"

# Mandatory fields - provided by the User/Workflow Server to the Processing Broker
workspace: 
  # Generated by the Workspace Server when a workspace is posted
  id: "uuid" # e.g., "4a5795d6-136c-40b7-8eed-eeca8ff1c249"
input-file-grp: "OCR-D-.*" # e.g., "OCR-D-DEFAULT"

# Optional fields - provided by the User/Workflow Server to the Processing Broker
output-file-grp: "OCR-D-.*" # e.g., "OCR-D-BIN"
page-id: "id" # e.g., "PHYS_0005..PHYS_0010" will process only pages between 5-10
overwrite: boolean
parameter:
  - key-1: value
  - ..
  - key-N: value
mets: "path" # e.g., "uuid/mets/mets.xml"
log-level: "level" # e.g., "INFO"
# "save the resource usage of the processor to a file in the root directory"
show-resource: "path" # e.g., "uuid/logs/resource-usage.txt"

# If this field is not provided, do not push results back to the message queue
result-queue: "name" # e.g., "ocrd-cis-ocropy-binarize-result"

Result response message example:

# Generated previously by the Processing Broker
job-id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"
  # Generated by the Processing Worker
  status: "value" # e.g., completed/failed
  return: "value" # e.g., 0/-1
  log-output: "output text"

Notes:

  1. The naming convention of fields follows the OCR-D CLI spec. I prefer to use the shortened form when available. For readability, I used the long form. I left some CLI options out because they are related to the discovery endpoint of the WebAPI, i.e., -J, -L, and -h.
  2. I think it's better to use an id instead of a path for workspace. The workspace id is unique and can also contain valuable data. Currently, we generate a random UUID (type uuid4), so no valuable data can be extracted. We could consider using another type, e.g., uuid1. Check here for more details. In that case, we would not need the created_time field and be able to extract that from the uuid1.
  3. I would not include the description field of the workspace. If needed, this information could be extracted by the Processing Worker from the workspace (when imported from the NFS). Let's keep the message size as small/compact as possible.
  4. The root directory of each workspace inside the Processing Worker will be its id - UUID. The mets path must point somewhere inside the root directory.
tdoan2010 commented 1 year ago

Some comments from my side:

  1. Please use underscore to separate words instead of dash, e.g. job_id instead of job-id.
  2. Why do you have id nested under workspace? Can we just have 1 field called workspace_id?
  3. show-resource should be named save_profiling_to.
  4. I don't see the benefit of using workspace_id over path_to_mets. Imagine the case when users already have a lot of data stored in their hard drive. If workspace_id is required, then they must somehow register those workspaces to obtain their IDs. So maybe a new feature, register workspace, of the Workspace Server is needed. Plus, by using workspace_id, the Processing Server has to first resolve those IDs, which leads to additional requests to make.
  5. If you still want to use workspace_id, I would suggest supporting both: either
workspace:
  id: 1234
  path_to_mets: relative/path/to/mets.xml

or only path_to_mets with absolute path.

  1. I would not include log_output in the result message because it is too much. Simply write the log directly to the database.
  2. In the result message, I would include only job_id, path_to_mets, and status. And they all are at the same level, no nesting.
MehmedGIT commented 1 year ago
  1. Please use underscore to separate words instead of dash, e.g. job_id instead of job-id.

As long as we use the same convention everywhere, i.e. the dash, I am okay with that.

  1. Why do you have id nested under workspace? Can we just have 1 field called workspace_id?

I left it nested after removing the description field and assuming that we may put other fields there. workspace_id: "uuid" is also fine.

  1. show-resource should be named save_profiling_to

Agree

  1. I don't see the benefit of using workspace_id over path_to_mets. Imagine the case when users already have a lot of data stored in their hard drive. If workspace_id is required, then they must somehow register those workspaces to obtain their IDs. So maybe a new feature, register workspace, of the Workspace Server is needed. Plus, by using workspace_id, the Processing Server has to first resolve those IDs, which leads to additional requests to make.

Good point. I have assumed that the workspaces will be stored on external storage. My conceptual understanding was that each Processing Worker will be a separate ocr-d processor. In that case, it did not make sense to think about the local storage case. Especially, when the diagram points from the Processing Workers to a Network File System. After checking #137, I see that different kinds of ocr-d processors may be deployed on the same host.

  1. I would suggest supporting both: either or path_to_mets with absolute path.

I think supporting both and making just one of the fields mandatory is the right approach to staying flexible with our options. Btw, there is a mets field as well. This field could be used for providing local paths. We just have to make the workspace_id field optional as well. Let me know if that approach is better?

I think utilizing the workspace_id field makes more sense for better scaling when using external storage and providing the OCR-D processors as a service. I imagine a group of users that will just upload their workspaces+workflows and gather the results. In that case, we stay more flexible on how and where to store the data without being stuck with a single path. What is your opinion on that?

  1. I would not include log_output in the result message because it is too much. Simply write the log directly to the database.

Agree

  1. In the result message, I would include only job_id, path_to_mets, and status. And they all are at the same level, no nesting.

Okay

I will edit my first post on top with the suggested changes so far and it will be the most up-to-date version.

tdoan2010 commented 1 year ago

Good point. I have assumed that the workspaces will be stored on external storage. My conceptual understanding was that each Processing Worker will be a separate ocr-d processor. In that case, it did not make sense to think about the local storage case. Especially, when the diagram points from the Processing Workers to a Network File System. After checking https://github.com/OCR-D/zenhub/issues/137, I see that different kinds of ocr-d processors may be deployed on the same host.

External storage can just be mounted. That's why there is NFS in the picture. The idea is: all processing servers share the same storage, either via local mount or network mount.

I think supporting both and making just one of the fields mandatory is the right approach to staying flexible with our options. Btw, there is a mets field as well. This field could be used for providing local paths. We just have to make the workspace_id field optional as well. Let me know if that approach is better?

Yes, either workspace or mets must be presented, not both, not none. This approach will also work when we have OCR-as-a-Service, in case users need to upload (or pull) data to the system.

PS: in the future, please do not edit your message completely as you did, but instead posting new ones. It would make it difficult for people (myself included) to follow the discussion when the old messages were edited.

MehmedGIT commented 1 year ago

The idea is: all processing servers share the same storage, either via local mount or network mount.

Okay, now I clearly see why providing the path is a cleaner approach when the storage is shared among the processing workers.

PS: in the future, please do not edit your message completely as you did, but instead posting new ones. It would make it difficult for people (myself included) to follow the discussion when the old messages were edited.

The first draft was compact but super complicated for others to understand. I preferred to get rid of that.

tdoan2010 commented 1 year ago

I do not agree with your current suggestion yet. Some points from my side:

  1. Nesting id under workspace is cumbersome. We just need to have a field called workspace_id, and it's mutual exclusive with path_to_mets.
  2. log_level is defined when we start a Processing Server, not for every run, or? Can we change the log level of a running processor?
  3. Same with save_profiling_to. The profile option must be provided at the time we start a processor, and we cannot change it when the processor is running, or?
  4. input-file-grp and output-file-grp must be lists, not single value.
  5. We cannot change the overwrite value of a running processor, right?

So, my suggested example:

job_id: uuid
processor_name: ocrd-cis-ocropy-binarize

path_to_mets: /path/to/mets.xml
input_file_grps:
  - OCR-D-INPUT
output_file_grps:
  - OCR-D-OUTPUT
page_id: PHYS_001,PHYS_002
parameters:
  params_1: 1
  params_2: 2

  # If this field is not provided, do not push results back to the message queue
result_queue_name: ocrd-cis-ocropy-binarize-result

# The time this message is created in Unix time
created_time: 1668782988590

Except job_id, processor_name, and created_time, the rest comes from the body of the POST /processor/{executable} request.

MehmedGIT commented 1 year ago
  1. Nesting id under workspace is cumbersome. We just need to have a field called workspace_id, and it's mutual exclusive with path_to_mets.

Agree.

  1. log_level is defined when we start a Processing Server, not for every run, or? Can we change the log level of a running processor?

Having more flexibility may help here. We could change the log level without having to restart the processor. Currently, I think it's not possible to change the log level of a running processor. However, this may be achievable if the log level is defined with an environment variable. We could drop that field out for now.

  1. Same with save_profiling_to.

Yeah, the same as in 2. We could drop that too.

  1. input-file-grp and output-file-grp must be lists, not single value.

I decided to go with a single value to avoid having to parse the input/output. I could change that to a list but then extra work has to be done to adapt the input/output to the processor parameters for each processing request. EDIT: Extra work = Extra processing time.

  1. We cannot change the overwrite value of a running processor, right?

Yes, same as in 2 and 3.

tdoan2010 commented 1 year ago

For input-file-grp and output-file-grp, please keep them as lists. "Extra work" is just 1 line Python code to convert a list into a comma separated string :)

MehmedGIT commented 1 year ago

Okay, here is the newer version with the proposed changes.

Processing request message example:

# Generated fields
processor-name: "ocrd-.*" # e.g., "ocrd-cis-ocropy-binarize", not necessary for processing, but might be useful for human
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd", generated by the Processing Broker

# Mandatory fields - provided by the User/Workflow Server to the Processing Broker
input_file_grp: 
  - "OCR-D-.*" # e.g., "OCR-D-DEFAULT"

# either of the two fields below must be available
workspace_id: "uuid" # e.g., "4a5795d6-136c-40b7-8eed-eeca8ff1c249", generated by the Workspace Server
path_to_mets: "absolute path" # e.g., "/workspaces/workspace1/mets.xml", locally available on the Processing Worker

# Optional fields - provided by the User/Workflow Server to the Processing Broker
output_file_grp: 
  - "OCR-D-.*" # e.g., "OCR-D-BIN"
page_id: "id" # e.g., "PHYS_0005..PHYS_0010" will process only pages between 5-10
parameters:
  - params_1: "value"
  - ..
  - params_N: "value"

# If this field is not provided, do not push results back to the message queue
result_queue_name: "name" # e.g., "ocrd-cis-ocropy-binarize-result"

Result response message example:

# Mandatory fields - reused from the processing request message or generated by the Processing Worker
job_id: "uuid" # e.g., "fe869a65-ea1c-4be5-8053-a3b8e01beecd"
status: "value" # e.g., completed/failed

# Optional fields - available only if provided in the processing request message
path_to_mets: "absolute path"
MehmedGIT commented 1 year ago

For input-file-grp and output-file-grp, please keep them as lists. "Extra work" is just 1 line Python code to convert a list into a comma separated string :)

By "extra work" I meant the conversion that will potentially happen millions of times (depending on how many messages are there) when parsing the messages, not that it's extra work for us to implement.

tdoan2010 commented 1 year ago

@MehmedGIT I created 2 JSON schemas, one for processing messages and one for result messages. You can find them currently in https://github.com/OCR-D/spec/pull/222. Please check them and try using them in your program to validate processing message. Some notes from me:

  1. Only processing messages need to be validated. Result messages are created by us, so validation is redundant.
  2. The schemas reflect the examples you posted above, with some additional points:
    1. created_time is required in processing messages.
    2. Either path_to_mets or workspace_id must be presented in result messages.
MehmedGIT commented 1 year ago

I will have a look and implement them soon.