NeurodataWithoutBorders / nwb_benchmarks

Benchmarking for NWB-related operations.
https://nwb-benchmarks.readthedocs.io/en/latest/
Other
4 stars 1 forks source link

Write-up summary of Aim 1 conclusion #71

Open CodyCBakerPhD opened 3 months ago

CodyCBakerPhD commented 3 months ago

For NIH report

CodyCBakerPhD commented 3 months ago

Less about code, more about capabilities that were added (and point to documentation)

CodyCBakerPhD commented 2 days ago

In the last year, NeuroConv has developed fully automated processes for building and deploying Docker images of the central package, as well as tangential data transfer utilities for use in cloud environments. These workflows are triggered through the free-to-use GitHub Actions on every official release as well as development branches refreshed daily. All Dockerfiles can be found in the public open-source repository under the /neuroconv/dockerfiles folder.

Additionally, a number of helper functions have been added to the neuroconv.tools.aws submodule such as an API function for automatically setting up an entire AWS EC2 Batch infrastructure, including all related details such as compute environments, job queue, and job definition. This tool is then leveraged to launch containers of the aforementioned images onto an on-demand EC2 instance in a two-step process: (i) Rclone is used to transfer data from a remote cloud storage source (such as Google Drive or Dropbox) onto the EC2 instance, where it is then (ii) converted to NWB format via a YAML specification file and uploaded directly to the DANDI archive. When all tasks are complete, all requested resources are spun down and cleaned up, minimizing costs to the user.

To ensure this pipeline continues to work far into the future, all steps from the Docker images to the helper functions are tested via pytest in continuous integration:

While individual batch job statuses can be tracked from the AWS dashboard, our entire workflow also sends status updates to a central DynamoDB table

image

with plans to further improve the resolution and provenance of the tracking in the future.

All usage instructions may be found in the official NeuroConv documentation, in particular:

oruebel commented 1 day ago

Thanks @CodyCBakerPhD for the helpful summary. A couple of quick questions:

  1. The last two links on data transfer in AWS tests and data conversion in AWS test are missing the link, could you please add those
  2. Is there an example for how to use these images on AWS?
  3. I was looking at the NeuroConv docs and found a couple of pages on Docker, but wanted to confirm that these are the documentation pages I should point to in the report and to learn how to use this:
  4. Could you clarify how the release for the images works. Looking at the neuroconv repo, if I understand this correctly the Docker images are being stored as packages in the CatalystNeuro GitHub organization at https://github.com/orgs/catalystneuro/packages?repo_name=neuroconv and then registered with the GitHub Container Registry so that they can be installed via something like docker pull ghcr.io/catalystneuro/neuroconv:latest. Is that how this work or am I missing something important?
CodyCBakerPhD commented 1 day ago

Sorry, I should have indicated this was still WIP - I was going to ping you once it's ready

  1. should be done by Monday
  2. The helper functions are the best examples
  3. The docs for the helper functions should be done by Monday as well
  4. Yep, exactly. That is where you can find them (they are also tagged by version and TBH I recommend using that most of the time for easier reproducibility)
oruebel commented 1 day ago

Sorry, I should have indicated this was still

Got it. Sorry for being eager with questions. This is very cool stuff

CodyCBakerPhD commented 23 hours ago

@oruebel OK, the rest has been filled in

Though some PRs are still under review, you will want to update the links for things after those get merged. Those sections that are not yet merged are

oruebel commented 20 hours ago

OK, the rest has been filled in

Thanks for the helpful summary!

sends status updates to a central DynamoDB table

Is this table public, and if yes, could you add the URL? If it is internal, is this accessible to the CN team?

some PRs are still under review, you will want to update the links for things after those get merged.

Thanks for the head up. Will do.

CodyCBakerPhD commented 20 hours ago

Is this table public, and if yes, could you add the URL?

Nope, since all access to/from is metered and charged

If it is internal, is this accessible to the CN team?

Yes, but there is nothing particularly special about this table aside the fact that it is the one used by the testing suite

The general idea is that the process can use DynamocDB to track status updates from any such table you want to specify. So if you used the tools yourself (including demo) you would get your own table for your own use, or you could make a public one for your team and everyone could then use it, etc.

Though also, nothing terribly special about DynamoDB in that respect (we could send status updates to any external target, like how we handle progress updates on NWB GUIDE), just that it is adjacent to all the other AWS entities and so feels a natural go-to for this kind of thing

oruebel commented 8 hours ago

So if you used the tools yourself (including demo) you would get your own table for your own use,

Thanks for the clarification. That makes sense. My impression was that this linkage to the table may be hard-coded, but having it configurable to the user makes sense.