Open SurajAralihalli opened 11 months ago
Do we mean by "support" the ability to run the spark-rapids-user-tools python package on Dataproc serverless? Or the ability to analyze the logs generated by apps running on serverless platform?
P0 scope:
dataproc-serverless
as a supported platform and validate that running qualification tool with this platform uses the available speedup factors for Dataproc Serverless.P1 scope:
dataproc-serverless
cost estimation for the qualification toolDo we mean by "support" the ability to run the spark-rapids-user-tools python package on Dataproc serverless? Or the ability to analyze the logs generated by apps running on serverless platform?
Hello @amahussein, I meant the latter one. The ability to analyze the logs generated by Dataproc Serverless applications (qualification, profiling). On a similar note #663 is a feature request to add the Dataproc Serverless job creation command in the qual tool output for Dataproc (for users migrating from classic dataproc to serverless)
After assessing this feature, we discussed that we need to be able to detect the cluster shape from the eventlogs. The reason is that users won't likely use the batchId of spark submission around for long. This means that we cannot use the batchID or cluster configs as inputs for the tools. Instead we have to rely on extracting the information from evntlogs/driverlogs.
So, this issue should depend on #581
Resources:
https://cloud.google.com/dataproc-serverless/docs/concepts/properties
Adding support for the Dataproc Serverless platform in spark-rapids-user-tools would be useful to the Serverless users.