Closed hanhou closed 3 weeks ago
Please document how this works once you implement it: https://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/200
Also let me know when it's about to be in effect. I will turn-off the current transfer routine to avoid duplication!
I think the plan was to have redundant data transfer for some period of time (a week?) to give us time to check for issues.
Sounds reasonable!
The above issue 2 has been fixed by Arielle's latest development branch.
This is what I get so far:
submit_job_response.status_code = 200
, no error in the console, and a new job shows up in http://aind-data-transfer-service/jobsHowever, the job fails silently after a few seconds. There is no error message in the console, and I even received a Job complete message in Teams.
I do see a new folder in aind-private-data-prod-o5171v
, but the content is incomplete
For comparison, here is what was uploaded to VAST
There may still be some issues in our metadata that fail some validation steps during the transfer.
@jtyoung84 is there a way to check the error log of the data transfer jobs? Do you think this is likely due to metadata errors?
@arielleleon it would be great if the WatchDog could also help track the job status after it's submitted to transfer-service, so that the job doesn't silently fail after a "complete" message in Teams. Leveraging user_email
in BasicUploadJobConfigs
may be enough (but integrating to the Teams message would be ideal).
Hi Han,
I noticed some recent jobs failed. We're able to monitor the jobs through airflow. We're working on exposing the logs publicly, but Arielle and I have accounts that allow us to monitor the backend. The issue with the previous jobs was that the source folders had a mixture of forward slashes and back slashes. I'm working on adding a step to verify that the source directories and the file contents exist (we're using sym links so this issue isn't currently caught until the data is uploaded). So the first jobs uploaded the metadata files to S3, but failed to upload the source directories because of the issue with the slashes, and then the newer jobs failed because the S3 folder was already created. @Arielle @.***>, maybe we can circle up later to see how to resolve this. I can manually upload using a csv file or python script if it's available, or I can manually modify the json.
Cheers,
Jon Young
From: Hou, Han @.> Sent: Monday, June 24, 2024 3:37 PM To: AllenNeuralDynamics/dynamic-foraging-task @.> Cc: Jon Young @.>; Mention @.> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Implement aind-watchdog-service in 447 and 446 (Issue #477)
The above issue 2 has been fixed by Arielle's latest development branchhttps://github.com/AllenNeuralDynamics/aind-watchdog-service/tree/development.
This is what I get so far:
image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/926f8175-3fdb-4c72-a1c2-1edfd301bf7f
However, the job fails silently after a few seconds. image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/4ef9af24-a168-445f-bf0c-ac994acd6b39 There is no error message in the console, and I even received a Job complete message in Teams. image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/59800b23-f476-4081-a129-21f6281b4a7e
I do see a new folder in aind-private-data-prod-o5171v , but the content is incomplete
image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/e07dd2a4-263f-46d0-86dd-1c182c7a334b
image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/404648c8-fb9b-49fc-82c9-e30f6836826a
There may still be some issues in our metadatahttps://github.com/AllenNeuralDynamics/aind-behavior-blog/discussions/408 that fail some validation steps during the transfer.
@jtyoung84https://github.com/jtyoung84 is there a way to check the error log of the data transfer jobs? Do you think this is likely due to metadata errors?
@arielleleonhttps://github.com/arielleleon it would be great if the WatchDog could also help track the job status after it's submitted to transfer-service, so that the job doesn't silently fail after a "complete" message in Teams.
— Reply to this email directly, view it on GitHubhttps://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/477#issuecomment-2187538668, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AY45IVJDVI7DMAGGWQBVVCDZJCNUBAVCNFSM6AAAAABIRORH4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBXGUZTQNRWHA. You are receiving this because you were mentioned.Message ID: @.***>
Hi @jtyoung84,
Thank you for your prompt reply!
I just tried another session with all forward slashes in the manifest.yml
file and it works! All data are uploaded to S3 and the data asset appears in CO.
@arielleleon I will create an issue requesting better handling of slashes in manifest.yml.
And yes, if would be great if the logs are publicly accessible!
Thanks!
Hey Jon and Han -
It might make sense to have watchdog check and clean the file paths on upload. That appears to be something I should do. @Han @.***> - can you send me an example manifest file that watchdog used for upload?
Arielle Leon Software Engineer, II M: 206.940.9936 E: @.**@.>
alleninstitute.orghttps://alleninstitute.org/ brain-map.org
From: Jon Young @.> Sent: Monday, June 24, 2024 3:55 PM To: AllenNeuralDynamics/dynamic-foraging-task @.>; AllenNeuralDynamics/dynamic-foraging-task @.>; Arielle Leon @.> Cc: Mention @.***> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Implement aind-watchdog-service in 447 and 446 (Issue #477)
Hi Han,
I noticed some recent jobs failed. We're able to monitor the jobs through airflow. We're working on exposing the logs publicly, but Arielle and I have accounts that allow us to monitor the backend. The issue with the previous jobs was that the source folders had a mixture of forward slashes and back slashes. I'm working on adding a step to verify that the source directories and the file contents exist (we're using sym links so this issue isn't currently caught until the data is uploaded). So the first jobs uploaded the metadata files to S3, but failed to upload the source directories because of the issue with the slashes, and then the newer jobs failed because the S3 folder was already created. @Arielle @.***>, maybe we can circle up later to see how to resolve this. I can manually upload using a csv file or python script if it's available, or I can manually modify the json.
Cheers,
Jon Young
From: Hou, Han @.**@.>> Sent: Monday, June 24, 2024 3:37 PM To: AllenNeuralDynamics/dynamic-foraging-task @.**@.>> Cc: Jon Young @.**@.>>; Mention @.**@.>> Subject: Re: [AllenNeuralDynamics/dynamic-foraging-task] Implement aind-watchdog-service in 447 and 446 (Issue #477)
The above issue 2 has been fixed by Arielle's latest development branchhttps://github.com/AllenNeuralDynamics/aind-watchdog-service/tree/development.
This is what I get so far:
image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/926f8175-3fdb-4c72-a1c2-1edfd301bf7f
image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/e07dd2a4-263f-46d0-86dd-1c182c7a334b
image.png (view on web)https://github.com/AllenNeuralDynamics/dynamic-foraging-task/assets/24734299/404648c8-fb9b-49fc-82c9-e30f6836826a
There may still be some issues in our metadatahttps://github.com/AllenNeuralDynamics/aind-behavior-blog/discussions/408 that fail some validation steps during the transfer.
@jtyoung84https://github.com/jtyoung84 is there a way to check the error log of the data transfer jobs? Do you think this is likely due to metadata errors?
@arielleleonhttps://github.com/arielleleon it would be great if the WatchDog could also help track the job status after it's submitted to transfer-service, so that the job doesn't silently fail after a "complete" message in Teams.
- Reply to this email directly, view it on GitHubhttps://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/477#issuecomment-2187538668, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AY45IVJDVI7DMAGGWQBVVCDZJCNUBAVCNFSM6AAAAABIRORH4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBXGUZTQNRWHA. You are receiving this because you were mentioned.Message ID: @.**@.>>
@arielleleon I just submitted an issue https://github.com/AllenNeuralDynamics/aind-watchdog-service/issues/26 with example yml files. Thanks!
After the FIP hackathon, the watchdog can now trigger this NWB packaging CO pipeline and ends up with a data asset like this:
The yml file for this data:
acquisition_datetime: 2024-05-31 09:23:24+00:00
name: behavior_713857_2024-05-31_09-23-24
platform: behavior
subject_id: 713857
capsule_id: c089614a-347e-4696-b17e-86980bb782c1
mount: FIP
destination: //allen/aind/scratch/dynamic_foraging_rig_transfer
s3_bucket: scratch
processor_full_name: Han Hou
modalities:
behavior:
- //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/behavior
behavior-videos:
- //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/behavior-videos
fib:
- //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/fib
schemas:
- //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/metadata-dir/session.json
- //allen/aind/scratch/svc_aind_behavior_transfer/447-2-D/713857/behavior_713857_2024-05-31_09-23-24/metadata-dir/rig.json
schedule_time: null
project_name: Behavior Platform
script: {}
Remaining issue of the watchdog:
@arielleleon pointed out that our current manifests are not formatted correctly: https://github.com/AllenNeuralDynamics/dynamic-foraging-task/issues/622
@arielleleon Installed the aind-watchdog-service
on 8A yesterday. Here are some notes.
Create a folder in C:ProgramData/aind/aind-watchdog-service
Move files from ?? to this new folder aind-watchdog-serivce.exe aind-watchdog-service.xml (for task scheduler) watch_config.yml (defines where to look for manifests, and where to put completed manifests)
set up task schedule. User a CLI command to use the xml file to start the task: schtasks /create /tn "GUI Automation\aind-watchdog-service" /XML "<path to .xml file>" /u svc_aind_behavior /s ip address
add environmental variable for watch_config path
@arielleleon built an easy deployment application. "if you want to deploy it on more systems, copy this batch file onto the desktop, right click and run as administrator. I like to double check that it's running in task manager (you will see the two corgie icons next to aind-watchdog-service.exe)"
"\allen\aind\scratch\ariellel\aind-watchdog-service-deploy.bat"
Steps
watch_dog.yml
:manifest.yml
file and the watchdog runs successfullyTroubleshooting / Notes
Example
watch_dog.yml
:Fields in
manifest.yml
capsule_id
?null
for now. Don't know if it will work.destination
: \scratch or \stage? --> \scratch on VAST would be fineproject_name
--> according to this, I'm using "Behavior Platform"s3_bucket
--> Literal["s3", "public", "private", "scratch"] (see code). Some hard-coded s3 buckets in aind-data-transfer-service. If it's Null, it will beprivate
by default (see code).Error
Could not trigger aind-data-transfer-service
submit_job_response
(status_code = 406
):Upload service job fails
Missing
rig.json
in the uploaded data assetrig.json
, notrig_447-1-A_xxx.json
!