AllenNeuralDynamics / aind-watchdog-service

Data staging service that prepares acquistion data for cloud upload to Amazon S3 and Code Ocean processing.
https://allenneuraldynamics.github.io/aind-watchdog-service/
MIT License
2 stars 1 forks source link

subprocess is not returning error stack for the key copying code #83

Open jeromelecoq opened 2 weeks ago

jeromelecoq commented 2 weeks ago

Is your feature request related to a problem? Please describe. Currently we have silent failures associated with watchdog. This is because the slow and key part of the process is handled by a subprocess which essentially isolate the main program from the potentially failing subprocess. This is visible here : https://github.com/AllenNeuralDynamics/aind-watchdog-service/blob/4020d4406149e7a06cd77ef4cf04f364f341edce/src/aind_watchdog_service/run_job.py#L117

Describe the solution you'd like I propose to forward up the errors so that we know what is going on.

import subprocess

def run_subprocess(self, cmd: list) -> subprocess.CompletedProcess:
    """subprocess run command with error checking"""
    try:
        subproc = subprocess.run(
            cmd, check=True, stderr=subprocess.PIPE, stdout=subprocess.PIPE
        )
        return subproc
    except subprocess.CalledProcessError as e:
        # This will print a Python stack trace if the subprocess fails
        print(f"Subprocess failed with return code {e.returncode}")
        print(f"Error output: {e.stderr.decode()}")
        raise  # Re-raise the exception to show the full stack trace

Describe alternatives you've considered Removing the subprocess entirely is an alternative solution. Do we really need it? It seems like shutil.copytree is actually faster so far in test implemented by Marton Rosza.

arielleleon commented 2 weeks ago

This is a good idea. We will prioritize to get this in

rhytnen commented 2 weeks ago

@jeromelecoq

I don't think Robocopy / subprocessing here is such a huge concern.

For context, when lims scheduler implemented this, it was because shutil used to be incredibly slow. While shutil is not incredibly slow anymore, robocopy is still providing certain features that should not be overlooked.

  1. Robocopy provides retries and this makes it robust in the face of network failures.
  2. Robocopy knows when NOT to copy something or when to only copy part of it.
  3. Finally, subprocess does enable another kind of feature which is allows the service to continue doing other things without using threads.

Robocopy does have certain failing though.

  1. it is windows only
  2. its return codes are opaque and sometimes not very useful
  3. its output is a text based table in the console and so not really useful in logging without massaging the text.