Closed miazamrai closed 2 years ago
Root cause: The list engines response is returning an engine with size XS/2 which is not an enum value in the EngineSize enumeration.
We shouldn't need to update the SDKs just b/c there is a new engine size and we certainly shouldn't break existing SDKs.
Why should the sizes be an enum? We moving to distributed engines where what we used to call Engine will become a cluster of engines.
Think of XS/2 being engine cluster with 2 node each being XS. There will be too many variant, XL/4, S/3 etc..
Creating an enum for all this is not scalable.
@Segflow would it be more natural to do this instead?
{
"type": "Cluster",
"engineSize": "XL",
"clusterSize": 5
}
and
{
"type": "Engine",
"engineSize": "XL"
}
XL/4
feels confusing to denote "4 engines in a cluster"
Moreover, is this cluster thing user-facing? I.e. do users need to do anything at all working with it vs a single engine?
If so, it's a breaking change requiring a client update, so it's certainly not a problem to update the SDK?
Changing the payload would break the api, something we cannot do right now. We understand that xs/4 is not ideal, but it allpw us to experiment with the feature without breaking other people workflow.
Reusing the same field to encode 2 different semantics allows us to roll out the feature faster.
I heard that in the future this will change and user will just say how many CPU they need.
Changing the payload would break the api
Is that actually true? Isn't it a non-breaking change to add a field? (And a breaking change to change the meaning of a field? -- Though in this case i still think the change made was pragmatic and reasonable 👍).
I would think that we should be able to add a field to the get_engine()
response without worrying about breaking changes. I think that @whatsthecraic would also like us to report the number of workers anyway, since it would be useful for the benchmarking framework to know how many workers an engine had.
Note: This issue has been migrated to https://relationalai.atlassian.net/browse/RAI-6831.
This link is only accessible to employees of RelationalAI.
ListEngines example is throwing the following exception
Environment to repro: Latest