Listing out all the remaining ideas for improvements to the cloud-related features recently added that I did not have time to complete
Most of these have # TODO comments in the code
There's a bunch of untagged hashes in the organization packages, likely from the daily dev branch builds pushing off the most recent 'dev' tag each time. I'm not sure how much utility there actually is in supporting long-term hashes for each daily branch in that way so someone should figure out a way to automatically clean it up after a certain period of time?
The DynamoDB 'status tracking' feature is used minimally in the tests and core feature to indiciate job submission, completion, and test passing/cleanup. But it could be used for so much more, including real-time progress updates from inside the running jobs (such as a TQDMPublisher callback that sends updates from the buffer iterators).
I suspect the whole IAM role code blocks originally used in job submission are unnecessary since everything is enacted by a credentialed client, and as such the IAM of the user overrides it (or the other way around?) - either way, there seems to be no issue removing the special IAM role for running batch operations and would be better off assigning that role to the specific AWS user account or removing it entirely
Figure out EFS One-Zone - it has much cheaper pricing overall than regional EFS but was tricky to get working correctly - actually setting up the EFS for that is simply via the commented keyword argument (and don't forget about the corresponding subnet), the tricky thing is guaranteeing instance spawning on only one subregion (maybe that's a configuration in the job queue or compute environment?)
Spot pricing: I did get this working in prototypes and overall would be a great way for a user to specify a scalar value between 0 and 1 to indicate the % of full on-demand price they are willing to pay - however, it comes with great caveats. The main NeuroConv deployment should only use spot pricing for VERY SHORT conversions, such as processed data. Bulk raw data conversions have a high likelihood of non-resumable interruption that leads to increased costs with no results to show for it. The Rclone transfer step is expected to go much faster than the NWB conversion step, and might actually be able to get away with it. But again, mostly for cases where you have a huge number of very small sessions.
I always had the nagging feeling there was something obvious missing from how to just harness the public base docker images for Rclone directly by writing the config (via environment variable just like we currently do) as a part of a CMD override instead of needing to build an entirely new image for that. But I could never get it to work
Should job definitions be cleaned up? I don't think they should be in practical usage since they provide very good provenance. Way I think of it is like GitHub logs - you can let them accumulate over time, and the dashboard has nice filtering mechanisms for finding any specific tags or other filters for sifting through them. They don't seem to cost anything to keep around. For tests, I'm less sure, regular cleanups could be good there.
Directly expose types of instances for each job: currently, it just grabs the 'best fit' instance type, but this could probably be further optimized to specific subtypes to reduce costs. The Rclone step especially does not need as much RAM as the NWB step does
EBS mounts: currently only uses EFS since that was easiest, but EBS is probably cheaper, though that would have to be proven as well. Size would also have to be calculated fairly precisely (with minimal assumptions regarding compression)
I wanted to use an /output subfolder on the EFS mount but it technically doesn't exist and won't make itself in the current docker images. I added the relevant code to the YAML functions in #1086 but will have to wait some release cycles to make its way into the images
Similarly, when #1089 is merged and released in the current docker images, the current test could be modified to include a DANDI upload operations at the end to ensure the data leaves AWS and lands on the archive
The rclone_command of the main deployment function can 'maybe' be simplified into minimal require user inputs? However, there are many special cases where additional flags to Rclone (such as common --drive-shared-with-me for Google Drive) are needed to be passed, which is why I expose the entire thing there
Listing out all the remaining ideas for improvements to the cloud-related features recently added that I did not have time to complete
Most of these have
# TODO
comments in the codeThere's a bunch of untagged hashes in the organization packages, likely from the daily dev branch builds pushing off the most recent 'dev' tag each time. I'm not sure how much utility there actually is in supporting long-term hashes for each daily branch in that way so someone should figure out a way to automatically clean it up after a certain period of time?
The DynamoDB 'status tracking' feature is used minimally in the tests and core feature to indiciate job submission, completion, and test passing/cleanup. But it could be used for so much more, including real-time progress updates from inside the running jobs (such as a TQDMPublisher callback that sends updates from the buffer iterators).
I suspect the whole IAM role code blocks originally used in job submission are unnecessary since everything is enacted by a credentialed client, and as such the IAM of the user overrides it (or the other way around?) - either way, there seems to be no issue removing the special IAM role for running batch operations and would be better off assigning that role to the specific AWS user account or removing it entirely
Figure out EFS One-Zone - it has much cheaper pricing overall than regional EFS but was tricky to get working correctly - actually setting up the EFS for that is simply via the commented keyword argument (and don't forget about the corresponding subnet), the tricky thing is guaranteeing instance spawning on only one subregion (maybe that's a configuration in the job queue or compute environment?)
A universal synchronization mechanism for the server-side asynchronous steps - this is currently handled via generic
time.sleep
calls with or without specific condition checks and maximum iterations (e.g., https://github.com/catalystneuro/neuroconv/blob/main/src/neuroconv/tools/aws/_submit_aws_batch_job.py#L484, https://github.com/catalystneuro/neuroconv/pull/1086/files#diff-fbec866f95a8edcf190ce00113773906f58d29fdf491a2c27dbf57dedecf9187R114-R136) - while the fundamentalboto3
API no doubt usesasync
to properlyawait
client-sent web requests (the things that return HTTP responses), the actual operations in EC2 are a step beyond that and often take a while to reach a point where we want to send a subsequent conditional web request. Some kind of clever context manager would no doubt workSpot pricing: I did get this working in prototypes and overall would be a great way for a user to specify a scalar value between 0 and 1 to indicate the % of full on-demand price they are willing to pay - however, it comes with great caveats. The main NeuroConv deployment should only use spot pricing for VERY SHORT conversions, such as processed data. Bulk raw data conversions have a high likelihood of non-resumable interruption that leads to increased costs with no results to show for it. The Rclone transfer step is expected to go much faster than the NWB conversion step, and might actually be able to get away with it. But again, mostly for cases where you have a huge number of very small sessions.
I always had the nagging feeling there was something obvious missing from how to just harness the public base docker images for Rclone directly by writing the config (via environment variable just like we currently do) as a part of a CMD override instead of needing to build an entirely new image for that. But I could never get it to work
Should job definitions be cleaned up? I don't think they should be in practical usage since they provide very good provenance. Way I think of it is like GitHub logs - you can let them accumulate over time, and the dashboard has nice filtering mechanisms for finding any specific tags or other filters for sifting through them. They don't seem to cost anything to keep around. For tests, I'm less sure, regular cleanups could be good there.
Directly expose types of instances for each job: currently, it just grabs the 'best fit' instance type, but this could probably be further optimized to specific subtypes to reduce costs. The Rclone step especially does not need as much RAM as the NWB step does
EBS mounts: currently only uses EFS since that was easiest, but EBS is probably cheaper, though that would have to be proven as well. Size would also have to be calculated fairly precisely (with minimal assumptions regarding compression)
I wanted to use an
/output
subfolder on the EFS mount but it technically doesn't exist and won't make itself in the current docker images. I added the relevant code to the YAML functions in #1086 but will have to wait some release cycles to make its way into the imagesSimilarly, when #1089 is merged and released in the current docker images, the current test could be modified to include a DANDI upload operations at the end to ensure the data leaves AWS and lands on the archive
The
rclone_command
of the main deployment function can 'maybe' be simplified into minimal require user inputs? However, there are many special cases where additional flags to Rclone (such as common--drive-shared-with-me
for Google Drive) are needed to be passed, which is why I expose the entire thing there