OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

nextflow spec questions #241

Open bertsky opened 1 year ago

bertsky commented 1 year ago

When we had the original discussions about a new workflow format to replace the de-facto standard ocrd process syntax in the core implementation, there was a general understanding that the spec must not fall short of the following features (met by the implementation):

However, as it stands, the Workflow Format spec does not seem to meet these criteria. It raises questions …

  1. Why is it necessary to specify the venv (and even de/activate it) and the workspace/mets path in NF?
  2. Why are reads and outs formulated as absolute paths in NF (instead of just fileGrp names)?
  3. Where is the actual METS perspective (instead of just filesystem artifacts like output directories, which can be empty or incomplete or simply not reflected in the METS at all)? Shouldn't it be possible to formulate Channels and Processes in a way that progress gets reflected by the actual METS/PAGE results?
  4. Why do output fileGrps have to be explicitly (manually) named (instead of using NF's pipe operator)?
  5. How do you continue processing a workflow on a workspace after it has failed earlier or ran another workflow with some shared steps earlier (i.e. incremental processing)?
  6. How can you monitor job status and access job logs? (Is the NF call meant to be combined by -with-tower or -with-report arguments? If so, how does the caller get to know which job is which during runtime?)

I understand that you tried to apply Nextflow to the OCR-D CLI directly. But currently I don't see a benefit over running the shell scripts directly (from a custom Workflow executor in core).

MehmedGIT commented 1 year ago

@bertsky

  • declarative – workflows contain no program code (thus are easy to understand and maintain)

The Nextflow scripts uploaded by the user will not contain program code. Things that have to be solved with program code should be implemented separately and provided by core as modules so the user can just call them in the Nextflow script.

  • universal – workflows can be formulated independent of the installation details (e.g. paths)

The Nextflow scripts will not have any path details inside. Required paths (i.e., mets path) will be passed as parameters to the Nextflow scripts. The user should just place a parameter placeholder in the script. The venv path can be omitted completely assuming that the path is available under PATH.

  • pure – workflows can be formulated independent of the data (e.g. paths)

Same as for universal.

  • well-defined – workflows can be validated without actually running them

The validation should be done based on what is called or used. I think there isn't a clear definition of what we should allow and not allow inside Nextflow scripts, is it? The Nextflow script validator should be written based on what is allowed or not.

the Workflow Format spec does not seem to meet these criteria.

To which spec exactly are you referring?

  1. Why is it necessary to specify the venv (and even de/activate it) and the workspace/mets path in NF?

It is not - was clarified above.

  1. Why are reads and outs formulated as absolute paths in NF (instead of just fileGrp names)?

It is enough to use strings, i.e., val inside the reads and outs blocks of the Nextflow process.

  1. Where is the actual METS perspective (instead of just filesystem artifacts like output directories, which can be empty or incomplete or simply not reflected in the METS at all)? Shouldn't it be possible to formulate Channels and Processes in a way that progress gets reflected by the actual METS/PAGE results?

The handling of the METS is done by the ocr-d processors called inside the Nextflow script. There are no explicit checks by the Nextflow scripts written so far based on produced output files.

  1. Why do output fileGrps have to be explicitly (manually) named (instead of using NF's pipe operator)?

I think you are referring to a specific Nextflow script, could you share which? Then I can provide a better answer. EDIT: You were referring here. We should update these examples because they are outdated.

  1. How do you continue processing a workflow on a workspace after it has failed earlier

With the resume option which is explained very well here.

... or ran another workflow with some shared steps earlier (i.e. incremental processing)?

In order to share previous steps, this should be done in the Nextflow script itself. The main workflow can contain sub-workflows and decide which sub-workflow to execute based on some parameter.

How can you monitor job status and access job logs? (Is the NF call meant to be combined by -with-tower

I have not used Tower before because its free plan is limited. There is no other way to monitor job status than querying the logs. Every time a nextflow script is executed - cache folder, process logs, process folders, etc. are created.

or -with-report arguments?

This is the final report which contains useful data about the workflow run. It is produced based on the created artifacts in my answer above.

If so, how does the caller get to know which job is which during runtime?)

Every process inside the workflow and the workflow itself has a unique ID. This is how Nextflow knows what and where to resume based on where the workflow has failed. To be more precise - it restarts the last failed process.

I understand that you tried to apply Nextflow to the OCR-D CLI directly. But currently I don't see a benefit over running the shell scripts directly (from a custom Workflow executor in core).

If you say it is easier to write the caching of workflow steps, keep separate logs of everything, produce execution reports, do the programming in the shell rather than in Groovy, limit resource usage for specific processes, multitasking, and the HPC-related interactions are easier in shell scripts than in Nextflow - then I don't see other benefits as well. You can still use your available shell scripts as a separate process in Nextflow, and use Nextflow as a higher-level entry point.

bertsky commented 1 year ago
  • declarative – workflows contain no program code (thus are easy to understand and maintain)

The Nextflow scripts uploaded by the user will not contain program code. Things that have to be solved with program code should be implemented separately and provided by core as modules so the user can just call them in the Nextflow script.

Understood. Nice!

(This should at least be mentioned in the current spec!)

  • universal – workflows can be formulated independent of the installation details (e.g. paths)

The Nextflow scripts will not have any path details inside. Required paths (i.e., mets path) will be passed as parameters to the Nextflow scripts. The user should just place a parameter placeholder in the script.

Ideally this parameter is passed by default in the included module scripts. The workflow definition should not require dealing with METS path (even as placeholder / variable).

The venv path can be omitted completely assuming that the path is available under PATH.

Ok, great.

  • well-defined – workflows can be validated without actually running them

The validation should be done based on what is called or used. I think there isn't a clear definition of what we should allow and not allow inside Nextflow scripts, is it? The Nextflow script validator should be written based on what is allowed or not.

I don't know what's possible within NF, but we used to have task_sequence.validate_tasks – OCR-D CLI syntax, but also processor parameter validation and checks of the chain logic (output fileGrps not already existing, input already existing or created earlier).

So if you understand the Workflow Format as "anything Nextflow allows", then indeed there is not much you can validate beforehand. But the original idea was to have a strict format (only processor calls in a "chain") that can be checked.

the Workflow Format spec does not seem to meet these criteria.

To which spec exactly are you referring?

To the current state of https://ocr-d.de/en/spec/nextflow

3. Where is the actual METS perspective (instead of just filesystem artifacts like output directories, which can be empty or incomplete or simply not reflected in the METS at all)? Shouldn't it be possible to formulate Channels and Processes in a way that progress gets reflected by the actual METS/PAGE results?

The handling of the METS is done by the ocr-d processors called inside the Nextflow script. There are no explicit checks by the Nextflow scripts written so far based on produced output files.

IIUC, at least for the CLI-based workflows currently implemented and in the spec, NF checks only for the processors exit status and its file artifacts. File artifacts in this case are merely the fileGrp directories' existence. So if the processor failed, or after a partial run on some pages, NF "thinks" the result for that step is already complete (because it has no notion of the actual METS fileGrps).

4. Why do output fileGrps have to be explicitly (manually) named (instead of using NF's pipe operator)?

I think you are referring to a specific Nextflow script, could you share which? Then I can provide a better answer.

I'm talking of the example given in the spec, and also used in the Quiver workflows. It's tedious to give each step a new name.

In contrast, examples in the NF documentation often use the | (pipe) operator.

5. How do you continue processing a workflow on a workspace after it has failed earlier

With the resume option which is explained very well here.

Ah, understood. Sounds great, but see – this is why it's problematic that NF does not get to know anything about the METS.

... or ran another workflow with some shared steps earlier (i.e. incremental processing)?

In order to share previous steps, this should be done in the Nextflow script itself. The main workflow can contain sub-workflows and decide which sub-workflow to execute based on some parameter.

I know this is possible, it's nice to have, but in general I disagree:

Look at it from the user's perspective: They can see workspaces (with some fileGrps) and workflows (with some fileGrps). Naturally, when they send processing jobs, they assume that existing fileGrps are skipped (and some error handling in case there's an actual conflict, i.e. same names but different processors/parameters).

There is no other way to monitor job status than querying the logs. Every time a nextflow script is executed - cache folder, process logs, process folders, etc. are created.

I see.

If so, how does the caller get to know which job is which during runtime?)

Every process inside the workflow and the workflow itself has a unique ID. This is how Nextflow knows what and where to resume based on where the workflow has failed. To be more precise - it restarts the last failed process.

By caller I did not mean NF, but the caller of NF. Sounds like everything is a black box.

MehmedGIT commented 1 year ago

(This should at least be mentioned in the current spec!)

The spec and the examples inside are greatly outdated - so I understand now why were the questions above numbered from 1 to 6 were raised.

Ideally this parameter is passed by default in the included module scripts. The workflow definition should not require dealing with METS path (even as placeholder / variable).

Understood.

So if you understand the Workflow Format as "anything Nextflow allows", then indeed there is not much you can validate beforehand. But the original idea was to have a strict format (only processor calls in a "chain") that can be checked.

Yes, that was the original idea - to have a strict format that is then translated into the Nextflow script. That's the reason why the workflow format description was started in #208. If you follow the discussions there - that idea was then dropped because we agreed on Nextflow scripts (.nf) without having a strict format.

I'm talking of the example given in the spec, and also used in the Quiver workflows. It's tedious to give each step a new name.

I agree that process names should be simplified to just step# instead of trying to find names for them.

In contrast, examples in the NF documentation often use the | (pipe) operator.

Yes, but the examples in the NF documentation are greatly limited and deal just with strings, integers, and single channels... When the result of a process has to be passed to the next process, it's hardly possible to use pipes without making the script hard for reading and understanding. Also, the input/output dependency of processes defines their execution order. In order to chain processes successfully, even if it's a dummy variable, something has to be passed from the out of the previous process to the input of the next process to prevent them from running in parallel.

Ah, understood. Sounds great, but see – this is why it's problematic that NF does not get to know anything about the METS.

Yes, I see it's problematic that NF knows nothing about the content of the Mets. However, searching inside the Mets file or checking paths for existing files feels too low-level to be dealt with that in the NF script itself. Ideally, inside the workflow description, there should be API calls for doing the mentioned steps. NF will then manage those steps.

Look at it from the user's perspective: They can see workspaces (with some fileGrps) and workflows (with some fileGrps). Naturally, when they send processing jobs, they assume that existing fileGrps are skipped (and some error handling in case there's an actual conflict, i.e. same names but different processors/parameters).

This again goes in the direction of my comment just above this one. The problem I have here is that there isn't a defined and clear path on what are potential errors, how to handle those errors, what are users' expectations for error handling, etc, in order to address them appropriately.

By caller I did not mean NF, but the caller of NF. Sounds like everything is a black box.

That's the workflow server then - right, to continue execution of a failed workflow, the workflow server should keep track of the Nextflow script name, and potentially the unique ID of that workflow run. Robustness and rerunning workflows were not considered thoroughly enough.

bertsky commented 1 year ago

BTW, the NF resume docs state that:

Note that you should avoid launching two (or more) Nextflow instances in the same directory concurrently.

IIUC this means that we must effectively manage NF working directories explicitly (via temporary directories created and removed by the workflow server).

Yes, I see it's problematic that NF knows nothing about the content of the Mets. However, searching inside the Mets file or checking paths for existing files feels too low-level to be dealt with that in the NF script itself. Ideally, inside the workflow description, there should be API calls for doing the mentioned steps. NF will then manage those steps.

Yes, that sounds prudent. NF just needs to know whether or not a step succeeded (whatever that step is) or should be repeated. No file artifacts ideally (so no path outputs in process blocks). To NF, everything is a side affect, both the output files and the METS update (which itself may not even be on the filesystem until resynchronization). Nothing on the lower levels should ever need to be rolled back, and so for NF it's all or nothing. (And nothing means NF error handling, i.e. your errorStrategy / maxRetries recipe.)

The problem I have here is that there isn't a defined and clear path on what are potential errors, how to handle those errors, what are users' expectations for error handling, etc, in order to address them appropriately.

Indeed. And that needs to be discussed for the lower levels first and foremost, i.e. on the Python API: Processor.process calls for now, or Processor.process_page calls soon. If we get fallback/skip/raise on page level right, then we can stack up complementary error handling for groups of pages on the processor level and ultimately on the Web API and workflow level.

By caller I did not mean NF, but the caller of NF. Sounds like everything is a black box.

That's the workflow server then - right, to continue execution of a failed workflow, the workflow server should keep track of the Nextflow script name, and potentially the unique ID of that workflow run. Robustness and rerunning workflows were not considered thoroughly enough.

We could use NF workflow introspection in our module scripts to talk to the outside consumers (i.e. workflow server or processing server or whoever called NF). For example, update the MongoDB with the workDir and scriptId or sessionId. Or even use onComplete and onError to set outside state accordingly.

But somehow it must be possible to manage and monitor NF jobs from outside. See https://github.com/SciDAS/nextflow-api

MehmedGIT commented 1 year ago

Note that you should avoid launching two (or more) Nextflow instances in the same directory concurrently.

IIUC this means that we must effectively manage NF working directories explicitly (via temporary directories created and removed by the workflow server).

Not really. I think the documentation pays attention to cases where you have two scripts in the same directory and they are used to start workflow jobs concurrently with each other.

Unless some wacky parallelization needs to be achieved that I currently cannot think of an example of such, this is not a problem from the Workflow Server's perspective. In the current state of the WebAPI impl, each Nextflow script uploaded through the Workflow Server gets stored under a separate directory named workflow-id, i.e., the uuid of that workflow. Then, inside that directory new directories are created for each triggered workflow job of that script. Again based on uuid. So, in the end, two or more workflow jobs can run independently of each other. Note that Nextflow already manages to create separate working directories for each process and there is no spatial overlap between them.

NF just needs to know whether or not a step succeeded (whatever that step is) or should be repeated. No file artifacts ideally (so no path outputs in process blocks). To NF, everything is a side affect, both the output files and the METS update (which itself may not even be on the filesystem until resynchronization). Nothing on the lower levels should ever need to be rolled back, and so for NF it's all or nothing. (And nothing means NF error handling, i.e. your errorStrategy / maxRetries recipe.)

Yes, exactly.

Indeed. And that needs to be discussed for the lower levels first and foremost, i.e. on the Python API: Processor.process calls for now, or Processor.process_page calls soon. If we get fallback/skip/raise on page level right, then we can stack up complementary error handling for groups of pages on the processor level and ultimately on the Web API and workflow level.

Yes. Nextflow itself cannot magically fix things unless it knows what options are coming from the lower levels. So error handling goes from low to high level, I see 5 different levels: OCR-D processor -> Processing Worker -> Processing Server -> Nextflow Manager -> Workflow Server (i.e., the reverse of the call order)

We could use NF workflow introspection in our module scripts to talk to the outside consumers (i.e. workflow server or processing server or whoever called NF).

Or even use onComplete and onError to set outside state accordingly.

Sure, but we should also be aware of potential problems related to the handlers. I remember that at some point I had the described issue as well. That's why I would rather use a module and report inside the process right before the process finishes - just to be on the safe side.

For example, update the MongoDB with the workDir and scriptId or sessionId

That's a good idea - to have a look at what useful runtime metadata we can store additionally in the DB. Although the workDir and the sessionId are redundant - since a uuid is already stored for these in the DB but under the same job-id. EDIT: sessionId is indeed important for resuming failed jobs - it must be stored in the DB. Although I think since each workflow job has its own separate folder, resuming jobs shouldn't be a problem when the cwd is that specific workflow job folder.