AllenNeuralDynamics / dynamic-foraging-task

Bonsai/Harp workflow for Dynamic Foraging with Python GUI for visualization and control
MIT License
5 stars 4 forks source link

Implement aind-watchdog-service in 447 and 446 #477

Closed hanhou closed 3 weeks ago

hanhou commented 4 months ago

Steps

Troubleshooting / Notes

  1. Example watch_dog.yml:

    flag_dir: C:/Users/admin/Documents/aind_watchdog_test/test_manifest_file
    manifest_complete: C:/Users/admin/Documents/aind_watchdog_test/test_manifest_file/manifest_complete
    webhook_url: https://alleninstitute.webhook.office.com/webhookb2/4c3dc391-bf45-40d7-a583-73f4a51a5b7a@32669cd6-737f-4b39-8bdd-d6951120d3fc/IncomingWebhook/5ac123a0b3714c04a7eafc11af1bb110/86c7eec9-38b0-4c80-95af-c32da6835725
  2. Fields in manifest.yml

    • [x] capsule_id?
      • according to this:

        process_capsule_id: Optional Code Ocean capsule or pipeline to run when data is uploaded

      • I'm using null for now. Don't know if it will work.
    • [x] destination: \scratch or \stage? --> \scratch on VAST would be fine
    • [x] project_name --> according to this, I'm using "Behavior Platform"
    • [x] s3_bucket --> Literal["s3", "public", "private", "scratch"] (see code). Some hard-coded s3 buckets in aind-data-transfer-service. If it's Null, it will be private by default (see code).
  3. Error Could not trigger aind-data-transfer-service

    • [x] I digged into the error message of submit_job_response (status_code = 406):
      
        Error parsing {"user_email":null,"email_notification_types":null,"project_name":"Behavior Platform","process_capsule_id":null,"s3_bucket":"private","platform":{"name":"Behavior platform","abbreviation":"behavior"},"modalities":[{"modality":{"name":"Behavior","abbreviation":"behavior"},"source":"\\\\allen\\aind\\scratch\\dynamic_foraging_rig_transfer/behavior_727353_2024-06-03_09-01-34/behavior","compress_raw_data":false,"extra_configs":null,"slurm_settings":null,"output_folder_name":"behavior"},{"modality":{"name":"Behavior videos","abbreviation":"behavior-videos"},"source":"\\\\allen\\aind\\scratch\\dynamic_foraging_rig_transfer/behavior_727353_2024-06-03_09-01-34/behavior-videos","compress_raw_data":false,"extra_configs":null,"slurm_settings":null,"output_folder_name":"behavior-videos"},{"modality":{"name":"Fiber photometry","abbreviation":"fib"},"source":"\\\\allen\\aind\\scratch\\dynamic_foraging_rig_transfer/behavior_727353_2024-06-03_09-01-34/fib","compress_raw_data":false,"extra_configs":null,"slurm_settings":null,"output_folder_name":"fib"}],"subject_id":"727353","acq_datetime":"2024-02-12T09:14:43","metadata_dir":"\\\\allen\\aind\\scratch\\dynamic_foraging_rig_transfer/behavior_727353_2024-06-03_09-01-34","metadata_dir_force":false,"force_cloud_sync":false,"s3_prefix":"behavior_727353_2024-02-12_09-14-43"}: 6 validation errors for BasicUploadJobConfigs
        modalities.0.slurm_settings
          Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
            For further information visit https://errors.pydantic.dev/2.7/v/extra_forbidden
        modalities.0.output_folder_name
          Extra inputs are not permitted [type=extra_forbidden, input_value='behavior', input_type=str]
            For further information visit https://errors.pydantic.dev/2.7/v/extra_forbidden
        modalities.1.slurm_settings
          Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
            For further information visit https://errors.pydantic.dev/2.7/v/extra_forbidden
        modalities.1.output_folder_name
          Extra inputs are not permitted [type=extra_forbidden, input_value='behavior-videos', input_type=str]
            For further information visit https://errors.pydantic.dev/2.7/v/extra_forbidden
        modalities.2.slurm_settings
          Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
            For further information visit https://errors.pydantic.dev/2.7/v/extra_forbidden
        modalities.2.output_folder_name
          Extra inputs are not permitted [type=extra_forbidden, input_value='fib', input_type=str]
            For further information visit https://errors.pydantic.dev/2.7/v/extra_forbidden
        ``` `
    • [x] the bug is fixed here
  4. Upload service job fails

    • See discussion
    • Solution: all slashes in the yml file should be "/"
    • Issue submitted to the watchdog lib: https://github.com/AllenNeuralDynamics/aind-watchdog-service/issues/26)
    • Example manifest.yml file that works:
      acquisition_datetime: 2024-06-21 13:54:00+00:00
      name: behavior_724555_2024-06-21_13-54-00
      platform: behavior
      subject_id: 724555
      capsule_id: null
      mount: null
      destination: //allen/aind/scratch/dynamic_foraging_rig_transfer
      s3_bucket: private
      processor_full_name: Han Hou
      modalities:
      behavior:
      - C:/Users/admin/Documents/aind_watchdog_test/test_behavior_file/behavior_724555_2024-06-21_13-54-00/behavior
      behavior-videos:
      - C:/Users/admin/Documents/aind_watchdog_test/test_behavior_file/behavior_724555_2024-06-21_13-54-00/behavior-videos
      fib:
      - C:/Users/admin/Documents/aind_watchdog_test/test_behavior_file/behavior_724555_2024-06-21_13-54-00/fib
      schemas:
      - C:/Users/admin/Documents/aind_watchdog_test/test_behavior_file/behavior_724555_2024-06-21_13-54-00/metadata-dir/session.json
      - C:/Users/admin/Documents/aind_watchdog_test/test_behavior_file/behavior_724555_2024-06-21_13-54-00/metadata-dir/rig.json
      schedule_time: null
      project_name: Behavior Platform
      script: {}
  5. Missing rig.json in the uploaded data asset image

    • Solution: The file name should be exactly rig.json, not rig_447-1-A_xxx.json!
    • After fixing the rig.json name, it appears in the data asset image
alexpiet commented 4 months ago

Please document how this works once you implement it: https://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/200

hagikent commented 4 months ago

Also let me know when it's about to be in effect. I will turn-off the current transfer routine to avoid duplication!

alexpiet commented 4 months ago

I think the plan was to have redundant data transfer for some period of time (a week?) to give us time to check for issues.

hagikent commented 4 months ago

Sounds reasonable!

hanhou commented 3 months ago

The above issue 2 has been fixed by Arielle's latest development branch.

This is what I get so far:

  1. The watchdog now triggers upload service successfully (code). submit_job_response.status_code = 200, no error in the console, and a new job shows up in http://aind-data-transfer-service/jobs

image

  1. However, the job fails silently after a few seconds. image There is no error message in the console, and I even received a Job complete message in Teams. image

  2. I do see a new folder in aind-private-data-prod-o5171v , but the content is incomplete

image

For comparison, here is what was uploaded to VAST

image

There may still be some issues in our metadata that fail some validation steps during the transfer.

@jtyoung84 is there a way to check the error log of the data transfer jobs? Do you think this is likely due to metadata errors?

@arielleleon it would be great if the WatchDog could also help track the job status after it's submitted to transfer-service, so that the job doesn't silently fail after a "complete" message in Teams. Leveraging user_email in BasicUploadJobConfigs may be enough (but integrating to the Teams message would be ideal).

jtyoung84 commented 3 months ago

Hi Han,

I noticed some recent jobs failed. We're able to monitor the jobs through airflow. We're working on exposing the logs publicly, but Arielle and I have accounts that allow us to monitor the backend. The issue with the previous jobs was that the source folders had a mixture of forward slashes and back slashes. I'm working on adding a step to verify that the source directories and the file contents exist (we're using sym links so this issue isn't currently caught until the data is uploaded). So the first jobs uploaded the metadata files to S3, but failed to upload the source directories because of the issue with the slashes, and then the newer jobs failed because the S3 folder was already created. @Arielle @.***>, maybe we can circle up later to see how to resolve this. I can manually upload using a csv file or python script if it's available, or I can manually modify the json.

Cheers,

Jon Young


From: Hou, Han @.> Sent: Monday, June 24, 2024 3:37 PM To: AllenNeuralDynamics/dynamic-foraging-task @.> Cc: Jon Young @.>; Mention @.> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Implement aind-watchdog-service in 447 and 446 (Issue #477)

The above issue 2 has been fixed by Arielle's latest development branchhttps://github.com/AllenNeuralDynamics/aind-watchdog-service/tree/development.

This is what I get so far:

  1. The watchdog now triggers upload service successfully (codehttps://github.com/AllenNeuralDynamics/aind-watchdog-service/blob/a16286a151f0ae702c35ced630134bf58272a702/src/aind_watchdog_service/run_job.py#L208-L225). submit_job_response.status_code = 200, no error in the console, and a new job shows up in http://aind-data-transfer-service/jobs

image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/926f8175-3fdb-4c72-a1c2-1edfd301bf7f

  1. However, the job fails silently after a few seconds. image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/4ef9af24-a168-445f-bf0c-ac994acd6b39 There is no error message in the console, and I even received a Job complete message in Teams. image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/59800b23-f476-4081-a129-21f6281b4a7e

  2. I do see a new folder in aind-private-data-prod-o5171v , but the content is incomplete

image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/e07dd2a4-263f-46d0-86dd-1c182c7a334b

image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/404648c8-fb9b-49fc-82c9-e30f6836826a

There may still be some issues in our metadatahttps://github.com/AllenNeuralDynamics/aind-behavior-blog/discussions/408 that fail some validation steps during the transfer.

@jtyoung84https://github.com/jtyoung84 is there a way to check the error log of the data transfer jobs? Do you think this is likely due to metadata errors?

@arielleleonhttps://github.com/arielleleon it would be great if the WatchDog could also help track the job status after it's submitted to transfer-service, so that the job doesn't silently fail after a "complete" message in Teams.

— Reply to this email directly, view it on GitHubhttps://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/477#issuecomment-2187538668, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AY45IVJDVI7DMAGGWQBVVCDZJCNUBAVCNFSM6AAAAABIRORH4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBXGUZTQNRWHA. You are receiving this because you were mentioned.Message ID: @.***>

hanhou commented 3 months ago

Hi @jtyoung84,

Thank you for your prompt reply!

I just tried another session with all forward slashes in the manifest.yml file and it works! All data are uploaded to S3 and the data asset appears in CO.

@arielleleon I will create an issue requesting better handling of slashes in manifest.yml.

And yes, if would be great if the logs are publicly accessible!

Thanks!

jtyoung84 commented 3 months ago

Hey Jon and Han -

It might make sense to have watchdog check and clean the file paths on upload. That appears to be something I should do. @Han @.***> - can you send me an example manifest file that watchdog used for upload?

Arielle Leon Software Engineer, II M: 206.940.9936 E: @.**@.>

alleninstitute.orghttps://alleninstitute.org/ brain-map.org

From: Jon Young @.> Sent: Monday, June 24, 2024 3:55 PM To: AllenNeuralDynamics/dynamic-foraging-task @.>; AllenNeuralDynamics/dynamic-foraging-task @.>; Arielle Leon @.> Cc: Mention @.***> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Implement aind-watchdog-service in 447 and 446 (Issue #477)

Hi Han,

I noticed some recent jobs failed. We're able to monitor the jobs through airflow. We're working on exposing the logs publicly, but Arielle and I have accounts that allow us to monitor the backend. The issue with the previous jobs was that the source folders had a mixture of forward slashes and back slashes. I'm working on adding a step to verify that the source directories and the file contents exist (we're using sym links so this issue isn't currently caught until the data is uploaded). So the first jobs uploaded the metadata files to S3, but failed to upload the source directories because of the issue with the slashes, and then the newer jobs failed because the S3 folder was already created. @Arielle @.***>, maybe we can circle up later to see how to resolve this. I can manually upload using a csv file or python script if it's available, or I can manually modify the json.

Cheers,

Jon Young


From: Hou, Han @.**@.>> Sent: Monday, June 24, 2024 3:37 PM To: AllenNeuralDynamics/dynamic-foraging-task @.**@.>> Cc: Jon Young @.**@.>>; Mention @.**@.>> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Implement aind-watchdog-service in 447 and 446 (Issue #477)

The above issue 2 has been fixed by Arielle's latest development branchhttps://github.com/AllenNeuralDynamics/aind-watchdog-service/tree/development.

This is what I get so far:

  1. The watchdog now triggers upload service successfully (codehttps://github.com/AllenNeuralDynamics/aind-watchdog-service/blob/a16286a151f0ae702c35ced630134bf58272a702/src/aind_watchdog_service/run_job.py#L208-L225). submit_job_response.status_code = 200, no error in the console, and a new job shows up in http://aind-data-transfer-service/jobs

image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/926f8175-3fdb-4c72-a1c2-1edfd301bf7f

  1. However, the job fails silently after a few seconds. image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/4ef9af24-a168-445f-bf0c-ac994acd6b39 There is no error message in the console, and I even received a Job complete message in Teams. image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/59800b23-f476-4081-a129-21f6281b4a7e
  2. I do see a new folder in aind-private-data-prod-o5171v , but the content is incomplete

image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/e07dd2a4-263f-46d0-86dd-1c182c7a334b

image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/404648c8-fb9b-49fc-82c9-e30f6836826a

There may still be some issues in our metadatahttps://github.com/AllenNeuralDynamics/aind-behavior-blog/discussions/408 that fail some validation steps during the transfer.

@jtyoung84https://github.com/jtyoung84 is there a way to check the error log of the data transfer jobs? Do you think this is likely due to metadata errors?

@arielleleonhttps://github.com/arielleleon it would be great if the WatchDog could also help track the job status after it's submitted to transfer-service, so that the job doesn't silently fail after a "complete" message in Teams.

- Reply to this email directly, view it on GitHubhttps://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/477#issuecomment-2187538668, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AY45IVJDVI7DMAGGWQBVVCDZJCNUBAVCNFSM6AAAAABIRORH4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBXGUZTQNRWHA. You are receiving this because you were mentioned.Message ID: @.**@.>>

hanhou commented 3 months ago

@arielleleon I just submitted an issue https://github.com/AllenNeuralDynamics/aind-watchdog-service/issues/26 with example yml files. Thanks!

hanhou commented 3 months ago

After the FIP hackathon, the watchdog can now trigger this NWB packaging CO pipeline and ends up with a data asset like this:

Image

The yml file for this data:

acquisition_datetime: 2024-05-31 09:23:24+00:00
name: behavior_713857_2024-05-31_09-23-24
platform: behavior
subject_id: 713857
capsule_id: c089614a-347e-4696-b17e-86980bb782c1
mount: FIP
destination: //allen/aind/scratch/dynamic_foraging_rig_transfer
s3_bucket: scratch
processor_full_name: Han Hou
modalities:
  behavior:
  - //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/behavior
  behavior-videos:
  - //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/behavior-videos
  fib:
  - //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/fib
schemas:
  - //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/metadata-dir/session.json
  - //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/metadata-dir/rig.json
schedule_time: null
project_name: Behavior Platform
script: {}
hanhou commented 3 months ago

Remaining issue of the watchdog:

alexpiet commented 2 months ago

@arielleleon pointed out that our current manifests are not formatted correctly: https://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/622

alexpiet commented 2 months ago

@arielleleon Installed the aind-watchdog-service on 8A yesterday. Here are some notes.

Create a folder in C:ProgramData/aind/aind-watchdog-service

Move files from ?? to this new folder aind-watchdog-serivce.exe aind-watchdog-service.xml (for task scheduler) watch_config.yml (defines where to look for manifests, and where to put completed manifests)

set up task schedule. User a CLI command to use the xml file to start the task: schtasks /create /tn "GUI Automation\aind-watchdog-service" /XML "<path to .xml file>" /u svc_aind_behavior /s ip address

add environmental variable for watch_config path

alexpiet commented 2 months ago

@arielleleon built an easy deployment application. "if you want to deploy it on more systems, copy this batch file onto the desktop, right click and run as administrator. I like to double check that it's running in task manager (you will see the two corgie icons next to aind-watchdog-service.exe)"

"\allen\aind\scratch\ariellel\aind-watchdog-service-deploy.bat"