Closed psschwei closed 1 day ago
Still todo: add tests for the new argument reading functionality
:tada:
Thank you so much @psschwei ! Really awesome work 🙏
Tomorrow I will take a look calmly. @IceKhan13 , @akihikokuroda this feature is important and introduces some good changes I would appreciate if you can invest some time reviewing it too ❤️
- we are limiting the execution to one
Job
per user because we are using the same file-name for each Job.
It's the same file name for each job, but each job is stored in a different location. When we submit a job, we create a temporary directory for the artifact: https://github.com/Qiskit/qiskit-serverless/blob/15fcb4eb8df2505ff0a292d3d370a4cfae0627ed/gateway/api/ray.py#L87-L91
and we save the arguments.serverless
file in that temporary directory. So each job should have its own distinct arguments file.
- we continue storing in the database the arguments and the final idea is to store those arguments in the COS.
I don't know... the more I think about it, the less I like the idea of putting the arguments in COS. Partially because it adds a layer of complexity that I don't think we need (splitting the job data between DB and COS). Partially because it ended up being pretty easy to add the arguments as a file rather than an envvar. And partially because if its a question of needing to save space, I think it might be better to be a bit more aggressive about somehow archiving DB records rather than trying to split them. But we can discuss it more on our next sync.
It's the same file name for each job, but each job is stored in a different location. When we submit a job, we create a temporary directory for the artifact
I need to admit that I was not aware of this part in the logic of the submit. Are we removing these entries at some point? Do you know?
we can discuss it more on our next sync.
I see your points. Definitely we will need to analyze it.
Other thing that I just figured out analyzing the code is that this is a breaking change, isn't it? How we are changing the get_arguments
in the client functions using the old get_arguments
will need a migration from current functions.
Are we removing these entries at some point? Do you know?
I don't think so... we should probably look into setting an expiration policy
the code is that this is a breaking change, isn't it?
Hmm... yeah, it is. This will require folks to upgrade their client. But I don't think there's any way around that -- seems like the max size of an envvar in Kubernetes is ~1MB.
Well, in theory we could do something like set the envvar if the args are <1MB. So anybody impacted by this issue would have to upgrade, but those who weren't impacted could keep going as before.... depends how much impact a breaking change would have...
I don't think so... we should probably look into setting an expiration policy
Yeah, it has sense. We can discuss about this in our sync next week too. I was thinking that maybe once time a Job finishes we can cleanup resources or something like that.
depends how much impact a breaking change would have...
Let me open the conversation with, Paco. I think it's a good opportunity to test a process around this. Thanks Paul 🙏
@psschwei @IceKhan13 @akihikokuroda I'm going to put this on-hold taking into account that introduces a breaking change so we cannot merge it right now. Let's try to find a way to be able to avoid the breaking change. I'm going to try to think in a proposal, feel free to propose something too 👍
from a slack discussion, here's the new plan:
after update, server sets:
* ENV_JOB_ARGUMENTS envvar
* serverless.arguments file
old client:
* reads from ENV_JOB_ARGUMENTS
new client:
* reads from serverless.arguments
Note that the logs noting that arguments were/weren't passed through the envvar are in the gateway pod, not the scheduler.
to test the backwards compatibility, deploy the latest gateway code but set v0.16.3 for the ray node in the rayclustertemplate
Summary
Provide support for passing large arguments (>1MB) for programs / functions. Utility-scale circuits, for example.
Details and comments
We used to use the Ray environmental variable
ENV_JOB_ARGUMENTS
to pass arguments to programs at runtime. While this is fine for small arguments, Ray wasn't able to handle larger ones (like circuits at utlity scale using 100+ qubits), which resulted in the job being stuck in the pending / initializing phase.So instead of trying to pass arguments by environmental variable, this PR updates the submission flow to write arguments to a file (
arguments.serverless
, overwriting any existing file) in the working directory so it will be included when the job is sent to the Ray cluster.Since this is happening in the gateway, it won't run into the
MAX_ARTIFACT_FILE_SIZE_MB
limitation, which is checked in the client before arguments would be added.