banditelol commented 1 year ago

Conference Notes

This Issue will contains all the notes I took and additional interesting ideas I found while watching conferences either in person or online on Youtube

banditelol commented 1 year ago

Makefiles: One Great Trick for Making Your Conda Environments More Managable | PyData Global 2021

Pydata is one of the better source of videos out there, while I still need to curate the content from time to time, but this is one of the best and practical one that I found. Also this pushed me to add Kjell Wooding to list of followed people.

In this talk he basically guide us step-by-step on the points that makes make a good tool for data science workflow and how to implement them. In total there are 7 points:

Use git to version your project
Use virtual environment when using python
Use your git repo name for your env name
Check in with your virtual environment. In this case you need environment.yaml for your conda stuffs
Use make for your environment management, because face it you don't remember the snippet to create new environment from environment.yaml don't you?
Never install package manually e.g. edit environment.yaml + make update_env or in poetry case you can use poetry add to do similar stuff. The neat thing is both can be abstracted behind makefile
Use Auto documentation, either in the makefile with adding comment before the command, or in-script by using docstring. Keep the docs near your code automatically, and only manually write down docs that has long shelf life
Separate what you want from what you need. This is the basic of lockfiles, you still write what you want in human readable interface (environment.yaml, requirements.txt, etc) but you need additional file for the actual installed dependencies (env.lock.yaml, req-v1-2.lock.yaml, etc)
If you implement all of those, don't be afraid to nuke it all from the orbit if you messed up a lot of things.

This talk also linked with Love Your (Data Scientist) Neighbour - Amy Wooding | PyData Global 2021, which defines it with 6 stages of reproducibility issues:

Where do I start: README and LICENSE
Where do I go next: organization + workflow via make
What to install: create_env and env.yaml
Where do i find the data: Dataset recipe ?
Does it work as expected: make test
Whats the purpose: use src module by editable install accompanying repo can be found here

banditelol commented 1 year ago

Kei Nemoto- Gentle introduction to scaling up ML service with Kubernetes + Mlflow | PyData NYC 2022

I've been wanting to implement MLFlow for managing my ML services. But I haven't gotten the cognitive bandwidth yet to do it for real. So for now I'll go on to my consumption mode and enjoy the talk. Anyway I'll try to go along with the hands on if possible.

Kei Nemoto (github.com/box-key) is DS in Montefiore Einstein Center for health. And the code for this talk is in box-key/pydata-kubernetes-mlflow

e.g. we create a ML service with stack of SKlearn-Flask-Gunicorn-Docker. In this case we have single container deployment "pattern".
What's the problem?
1. Host downtime
2. Update downtime
3. Resources Limitation (service is limited by host resource)
Kubernetes supposedly solve these 3 problems. And watch kubernetes documentary by honeypot.

What's additional problems introduced by this solution?
Kube: Control Plane + Workers. Kubelet is

What is workers vs node? And What is kubelet?
Kubernets solve this problem by using deployments.
1. Host downtime doesn't matter as long at least one worker is active
2. Rolling update is built-in into the kubernetes
3. Scalability is "easy" as we can just replicate as much as load
Create deployment is done by using kubectl apply and a deployment will generate several pods. So this assume Nodes and Control plane is already set up?
Cluster IP (the service) Service (resource kind, on the level of Deployment, Job, etc.) is load balancer which decide which pods will get the request.

But how much of a bottleneck is this service? and how to scale this s rvice? Also is this only available inside Kubernetes
There's also NodePort Service. This can be done by exposing specific port. And as the request hit the nodeport it'll be redirected to cluster IP.
Load Balancer Service is an external LB for the nodes (as opposed to ClusterIP role as LB for the pods).
Should we run 4 pods? Aren't we wasting resources? Knative! a serverless service. Wait, so is it a replacement for LB?
Now that we've covered the service itself, how do we handle the model binary? We could build image for each binary model. But it's hard to track the lineage of each model. So MLFlow Model Registry!
Data Scientist -> Model Repostory -> ML Eng -> Embed Connection Link to Image (URI) as env variable -> Build the image
Model version is several version of the same model.

How should we define what is "the same model"
K3S (lite) in Digital ocean native VM using 3 nodes (2 workers (1GB-1CPU) and 1 Control Plane (2GB-1CPU). Running Knative. Image Repos DockerHub, MLFlow (1GB-1CPU) for registry + server.
QnA
1. Challenges in Realistic Scenario: Hard to setup K3S and faced a lot of network issues. Also it's possible you have multiple VMs in one rack and all hell break loose
2. KNative practicality for latency : warm start vs cold start, how about the downtime? For big model it can take a long time. And you can also have 1 pods on standby.
3. Production Distribution: We have to on prem kubes cluster for health domain.
4. Load Balancer Ips/Subnet: ClusterIP can have a CIDR. Kubernetes doesn't setup subnet, it uses existing subnet.
5. MLFlow setup multitenancy: Not yet.
6. Kubernetes vs GCloud : not apple to apple, but compared to Mezos, Yarn, swarm its known to be more stable.

Welp. It's a lot more Kubernetes related than I expected. I thought it'll be more of a talk about practical MLflow-Kubernetes hands-on. Anyway the question on CIDR and how kubernetes network work lead me to this AWS page about VPC. It' worth a read to better understand how the IP Addressing component interact.

banditelol / public-notes