AllenNeuralDynamics / aind-watchdog-service

Data staging service that prepares acquistion data for cloud upload to Amazon S3 and Code Ocean processing.
https://allenneuraldynamics.github.io/aind-watchdog-service/
MIT License
2 stars 1 forks source link

Watchdog uploads at 20MB/s from the Bergamo 2p rig in 422 #82

Open rozmar opened 2 weeks ago

rozmar commented 2 weeks ago

@arielleleon has been working on this for a while. The original watchdog implementation is copying to VAST very slowly. (robocopy) Arielle then implemented the VAST api, but that's also very slow.

Meanwhile, copying with windows to vast or using shutil.copytree nearly maxes out the connection (90MB/s)

jeromelecoq commented 2 weeks ago

The copy is done here : https://github.com/AllenNeuralDynamics/aind-watchdog-service/blob/4020d4406149e7a06cd77ef4cf04f364f341edce/src/aind_watchdog_service/run_job.py#L122C1-L161C20

jeromelecoq commented 2 weeks ago

watchdog has a nice abstraction of copying so in principle we should be able to get this faster if we figure out what is holding this up.

jeromelecoq commented 2 weeks ago

From ChatGPT:

There are a few potential reasons why the file copy process could be slow, even when using robocopy, which is generally optimized for better performance:

rozmar commented 2 weeks ago

I also wonder if this comes from the fact that windows task scheduler runs a binary version of watchdog - maybe it would be faster if we just ran it in python directly?

jeromelecoq commented 2 weeks ago

It is possible the multi-threading is blocked if the windows task scheduler does not allow multiple threads. You are setting up what is available to the agent in windows.

jeromelecoq commented 2 weeks ago

You could also call the python directly in the scheduler. That is how we were doing QC for ophys for several years and it was working fine.

arielleleon commented 2 weeks ago

@jeromelecoq Robocopy can be configured to copy faster; this is an exercise in playing with the available parameters in Robocopy as you showed above.

We are looking into s3 protocol commands to execute transfers quickly. Mike T has done this successfully. There are issues with this on the Bergamo rig which are being looked into with IT.

If we want to add a configuration to determine which protocol is used (SMB vs S3) and which application is used for each protocol (shutil vs Robocopy), that id definitely possible. I will leave that for SIPE to determine how they want to do things.

fwiw; robocopy is the only thing I have run on Bergamo. We have been debugging S3 protocol on Bergamo for a while and it has been going slowly because of rig and human (Stuart and me) availability.

jeromelecoq commented 2 weeks ago

I think it is an important issues as I understood the bergamo team has been building an alternative upload solution aa a result of this issue and found that just changing the python code seem to give faster uploads. Ideally the Bergamo team can help with debugging this issue.

LKINSEY commented 2 weeks ago

Hi all, Bergamo teammate here. I talked to Marton a little about this and thought I'd do some investigation into this as well because from what I've read so far it does seem calling robocopy from a subprocess should be more efficient at copying files to a network drive than shutil.copytree (what I put into the PyQt6 gui to make copying all data <10 minutes). I am no expert, but after interrogating google and chatgpt, I am suspicious that using an unbuffered I/O might be contributing to some inefficiencies during robocopying. roughly around 50% of the raw data from experiments is <100 MB, from reading online this might cause unbuffered I/O to be less efficient whereas if all of the data was >100 mb, robocopy /j would be really fast and efficient. image

maybe we can modify the execute_windows_command function by first sorting the files by size, then call 2 run_subprocess functions inside execute_windows_command, one subprocess copying large files with /j and the other subprocess copying small files without /j

jeromelecoq commented 2 weeks ago

Ok. @arielleleon can we look into including some of these feedbacks? It is really sad that Lucas and Marton are building their own GUI for this and it impacts our ability to consolidate.