Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.12k stars 765 forks source link

Remote debugging on kubernetes #1481

Open DonIvanCorleone opened 1 year ago

DonIvanCorleone commented 1 year ago

Hi there

I have an idea and some lines of code, which I would like to propose/discuss with you @saikonen

Identified shortcoming

At the moment metaflow can not be easily used to make remote debugging possible. - With remote debugging I mean on the pod/container within the Kubernetes cluster. - Instead we have to debug locally. This has the obvious disadvantage that potential hardware constraints can kick in. If we resume a failed step and load the to-be-debug step with input data from e.g. S3 we require sufficient memory on the client PC (assuming that the input data fits into client memory). Another example, if in the to-be-debugged-step we require GPU hardware (or any other scarce resource) we basically imply need of those resources on the client PC which not necessarily is available.

At the moment I do not see a good way to overcome this problem with already built in functionality.

Idea

If we could leverage debugpy and VS code remote-debug capabilities we could attach our local debug session to a remote-debug server as long as we make appropriate port-forwarding etc. possible

PoC

I used for testing purpose @conda_base(libraries={'debugpy':'1.6.7'})

In kubernetes_job.py line:150 I added in the V1Container specification ports=[client.V1ContainerPort(container_port=5678, host_ip="0.0.0.0", host_port=5678)],

This maps the default debugpy server ports from container to host. Of course this could be either mapped with default values or custom CLI defined or anything else that would be suitable

In kubernetes_cli.py line:180 which creates the step_cli variable I changed the entrypoint to entrypoint="%s -m debugpy --listen 0.0.0.0:5678 --wait-for-client %s" % (executable, os.path.basename(sys.argv[0])),

This is specifically used to start the debugpy session and let it wait until we attach to the process – ideally with already loaded breakpoints :)

Last using a custom debug launch.json in VS code "configurations": [ { "name": "Python: Remote Attach", "type": "python", "request": "attach", "connect": { "host": "<node ip>", "port": 5678 }, "pathMappings": [ { "localRoot": "<local path of your flow>", "remoteRoot": "/metaflow" } ], "justMyCode": true } ]

Potential future

It would be cool to have the option to start a metaflow resume with debug and kubernetes

resume debug –with kubernetes or similar

This debug option could automatically install debugpy in the target container and initiate the step using debugpy. Port mapping might be a bit tricky, particularly in case of debugging of an foreach step. I am currently not sure what would be optimal here. Any ideas?

Drawback of the above outline solution

1) Its only tested for Kubernetes and I have no idea if this would be an option for AWS Step Function, Batch, GCP and all the others as well 2) debugpy is for VS code and I only tested it for this specific setup. Other IDEs like PyCharm etc. should work similarly but I do not know and have not tested it at all

Its a rough sketch but I hope I could describe the idea properly. Would you mind to consider it for you upcoming work? From my POV it would help a lot for many users of metaflow.

Happy to discuss.

DonIvanCorleone commented 1 year ago

Just realized that this issue is tangentially related to #739

DonIvanCorleone commented 1 year ago

Hi @saikonen,

by any chance do you have an opinion on the idea?

DonIvanCorleone commented 1 year ago

Hi @saikonen

I just swing by to check if you had the chance to check the above described topic? Looking forward to your response :)

All the best

saikonen commented 1 year ago

Sorry for the long delay, finally back from my holidays :) Some immediate questions that came to mind

Had a quick test with your instructions, but I ran into a wall regarding direct node-access being blocked for our Kube cluster. I see the usage with the vscode plugin (or the CLI) being quite tedious though, requiring determining a node-IP, fiddling with the configuration, and attaching to the debugger. To my understanding debugging multiple steps of a flow would require going through the whole process for each step, as there is no guarantee that they run on the same node.

If you have some specific use cases in mind already for a debugger then these could be a good starting point for fleshing out what a debugging feature would look like feature-wise. I would like to try and outline the problems we're trying to tackle with a debugger before starting any implementations, but if you want to move forward with a PoC that is fine as well.

saikonen commented 1 year ago

The recently announced Metaflow Office Hours meeting could also be a good opportunity to demo/have a discussion about the feature if you're interested and can attend. Details at https://outerbounds-community.slack.com/archives/C01TTBG855K/p1691533065990379

DonIvanCorleone commented 1 year ago

Sorry for my delayed response. Quite some busy days behind me and not looking so the next couple of weeks either but I at least wanted to answer a couple of your questions.

First of all: Thanks for getting back to this ticket, hope you had a nice and relaxing vacation time. Second: I currently have not tested any other way but I do fully understand your concerns here. I am pretty sure that it should be possible using a slightyl different approach which is much more security friendly. Unfortunately, currently I do not have the time to search for different possiblities. But I will certainly do so if I have more time in my hands. Third: Due to my business use case I can only test on-prem kubernetes cluster. All other platforms are totally unknown and to be honest I only have very limited knowledge about them as well. Fourth: Debugpy works seamlessly with VSCode but e.g. Pycharm favors another different debugger and I am certain there are many more. I tested it with debugpy because I use vscode and it has remote debugging capabilities (but I believe I read that the pycharm debugger owns this feature as well)

I fully support your approach to first understand the problem and design a solution rather jumping into conclusions. I am uncertain how I could support you with something like that. Is there anything from your side that might be needed where I can bring in my 2 cents?

DonIvanCorleone commented 1 year ago

Hi @saikonen

just one minor update with respect to your question "Have you considered alternative ways of accessing the debugger running in the pod if direct access is restricted?" -

I have tested the following

  1. Added entrypoint="%s -m debugpy --listen 0.0.0.0:5678 --wait-for-client %s" % (executable, os.path.basename(sys.argv[0])) in kubernetes_cli.py
  2. Added ports=[client.V1ContainerPort(container_port=5678)], in `kubernetes_job.py
  3. Added a service on kubernetes mapping internal to external
  4. Connecting VS Code with the nodeIP and external port

service.yaml looks like this

kind: Service
apiVersion: v1
metadata:
  name: debug-hostname-service
spec:

  type: NodePort

  selector:
    app.kubernetes.io/name: metaflow-task

  ports:
    - nodePort: 30163
      targetPort: 5678

Using dedicated "debug-labels" (instead of the current selector) one can assure that the right pods are redirected to the stable service nodePort. Would this be more suiteable from you POV?

DonIvanCorleone commented 1 year ago

addendum: The above shown approach works best if the kubernetes labels can be applied. Has this feature been reverted? At the moment I cannot find it anymore in the code base?!