flatironinstitute / dendro-old

Analyze neuroscience data in the cloud
https://flatironinstitute.github.io/dendro-docs/
Apache License 2.0
19 stars 2 forks source link

Log Jobs activity from the Controller node #119

Open luiztauffer opened 10 months ago

luiztauffer commented 10 months ago

It might be interesting to log the activity for jobs from the controller node, this will help us catch errors that happen before the dendro processes run inside a successfully started job.

For example, I just got a Batch job failing right before the start, some authorization issue when pulling the docker image... the Dendro logs are empty, but I can find the source of error using boto3:

batch_client.describe_jobs(jobs=['4f9aa92c-966a-48f0-894f-444c9859cdb3'])
'status': 'FAILED',
   'attempts': [{'container': {'containerInstanceArn': 'xxxxxxxxxxxxxxxxxxx',
      'taskArn': 'xxxxxxxxxxxxxxxxxxxxxxxxx',
      'reason': 'CannotPullContainerError: Error response from daemon: Head "https://ghcr.io/v2/catalystneuro/dendro_si_kilosort25/manifests/latest": unauthorized',
    },
magland commented 10 months ago

So I think we'll need to store the aws batch job id in the dendro job record.