canonical / bundle-kubeflow

Charmed Kubeflow
Apache License 2.0
101 stars 48 forks source link

Charmed Kubeflow bootstrap and runaway disk access on Ubuntu (20.04-22.04) #619

Open millerhooks opened 1 year ago

millerhooks commented 1 year ago

Background First off, I'd like to say that I am not a particularly smart man. Most of the stuff I'm talking about here is pretty far outside of my normal day to day, it is one of the more interesting computer problems I've run across that has plagued me that is clearly not me just misreading some documentation (Oh sweet lord would I love it if I was just missing a step). I am claiming to be an expert of stupidly setting up Kubeflow on computers that may not be capable of running it, I am claiming to be an expert in distributed GPU driven super/hyper computing... My knowledge of the linux kernel is spotty at best.

This is a problem I've been trying to get my head around for almost a year. I poorly documented a solution in the wrong issue for ml-pipelines. I hope that got moved or moderated. I did notice that the issue got added to the Charmed Kubeflow's documentation, at first as a hidden tip and a week or so ago it was bumped up to be not hidden.

I gather that not many people are messing with Charmed Kubeflow. I think most people using it have internal virtualization setups like v-sphere or are deploying to the cloud with something like Arrikto. I have been training people to run distributed ML/AI jobs with a specific focus on GPUs and doing it with open source tools from first principles for close to 8 years now. I worked at Nivida on RAPIDS AI. I've done a lot of consulting. I've developed internal training programs and devops tools.

What I like about Kubeflow is that it has everything plus the kitchen sink. A lot of people consider it overkill, but those people haven't had to deal with multi-cloud to the edge and data providence. It's become kind of my litmus test, if you can't run Kubeflow, we don't have much to talk about. I have been getting people who have no business running Kubeflow to bootstrap it on consumer hardware for two years. I have run into this problem on everything from high end servers that cost as much as a new commuter car to over and over again a melting pot mishmash of consumer hardware. THIS PROBLEM EXISTS IN EVERY DEPLOYMENT.

The Problem What I'm talking about is this.

sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360

Linked in the documentation here: https://charmed-kubeflow.io/docs/get-started-with-charmed-kubeflow#heading--crash-loop-backoff

This is such a strange problem and I need some help figuring out how to solve it. For one, I think the current solution (which I vultured from the internet) is incomplete... but also this solution is 100% cargo cult BS. It doesn't matter what you set the numbers to at startup as it's creating the hostpath-storage connections it will saturate whatever the number is. Also once things are set up, that saturated disk connection goes back down to a very normal baseline. I have no idea how to monitor these things. I don't know what thing is causing this issue. I think maybe this can be avoided by not using hostpath-storage and using an NFS provisioner, but then that breaks having a self contained "Charmed Kubeflow" bootstrap.

Also this solution when (not if) you restart the computer, the values are not kept. You have to save the changes to etc/sysctl.conf. There are also more useful variables. This problem is seemingly addressed here for slightly different reasons in some IBM documentation. Which at this exact moment I can't find my link to, but I will edit and respond later with that information. It was a bit Ah-Ha moment for me. https://www.ibm.com/docs/en/tncm-p/1.3.1?topic=environment-running-prerequisite-script (I found it by typing IBM into the search bar. It was the only result. That's how often I end up on IBMs docs for solution. Big kudos to IBM)

I cannot find any information about what happens when you raise those number too high. I've raised them just by orders of magnitude to see what breaks and at astronomical numbers only have barely correlative evidence of an important effect. I suspect that that is because these settings aren't really geared for this new style of computing and stuff and they are kind of a relic of an older time, but also the linux kernel needs to support so many systems and this is such an outlier. Anyway, I think handling this correctly will be SOP for distributed HPC systems going forward. I just want it to go away so I can get back to work.

What I Think Is Going On Really, really understand that I'm pulling all of this out of my own ass. I would prefer someone with more domain knowledge solve this problem, but it seems like I'm on my way.... So any assists y'all have, I'd much appreciate.

Disks, this problem changes depending on the disk. This problem looks a lot like the issues you get when etcd can't communicate properly in a distributed cluster or multi-cluster environment. This problem can be invisible. This problem can become background noise as you solve it on accident and later bite you in the butt when you don't understand why after a restart, your precious pet no longer feels like playing.

The most difficult Kubeflow setup I ever had this problem with was a system I never saw in person. It was USB-C SATA drive. I don't know the providence of the enclosure. It was a single GPU gaming PC with 64gigs of ram an an i7 of some sort. It dual booted windows. I could never get that thing to run. I think I could now though, but I don't have access to that PC and don't care.

NVMe drives seem to work the best. Especially the Ironwolf ones I have in my best gear. Generally speaking I need to bump the numbers once after the bootstrap starts failing and it completes and then even with a restart it's never a problem.

Standard SATA SSDs though, hooked straight through SATA. The numbers will need to be bumped multiple times at bootstrap.

Proposed Solution Since I have multiple machines to profile and can for sure recreate this at a whim. Not that I need to make a special errand for it, my girlfriend is currently working through some courseware I'm putting together for Charmed Kubeflow on a dual processor, dual GPU, 140g RAM, System76 Bonobo and we are running into the problem AGAIN. And it's on an NVMe. I'll solve it with some fiddling but this problem, once fully identified can be an automated part of the bootstrap and increase the adoption of Kubeflow considerably.

It only really clicked for me that this was a bottleneck issue based off of storage that was actually solvable last week. So forgive me that I'm not coming at this with some easy stats and a fully baked solution. What would that solution look like? I've got a little script I've half baked to benchmark the disk and then as the connections increase when Kubeflow's juju install kicks off can maybe monitor a socket or something that will report disk connections.

Parting Note I currently have a lab with 5 machines running Kubeflow. I've got a couple dozen extra nodes spread all around the country and focus mostly on edge compute, ML/AI, data providence and secure publishing. I can and am testing this problem over and over. And I will solve it, because I can't train and onboard people until this bump is smoothed out. Anyone that has any knowledge to help me solve this problem, many many thanks in advance.

I will be following this issue closely for any feedback, seriously. Three years ago it took me a year to get two untrained people to run ML/AI jobs in Kubeflow. At the start of this year I jumped into a new cast of characters. It took me 3 months. I think I've got it nailed down because my outing for the month of June took 4 1.5 hour sessions before I not only had the student running Kubeflow, but asking questions I ddin't have answers to. 3 of those 4 sessions were all because of this issue.

I am collecting all the parts of my investigation so I can share it and will be adding to this issue thread over the course of this week. Realistically, it'll take me a few days to get all the stuff compiled here in a way that isn't totally obnoxious.


This is probably a faux-pas to do in an issue, but whatever. It's not like anyone will read this. I'm doing some workshops over the summer. If you'd like to get in on them, fill out this thing. https://forms.gle/hCwZF8suoWbzCMXaA I have a thing launching this summer. DTMF.com and 0xf.ai... maybe 0110.wtf. Who knows! Ocelots.net? KingZero? When you have autoDNS, auto Lets Encrypt, mTLS, Kubeflow, a private container registry, a private git registry, and whatever; You get a solid mesh network and the ways the internet can get are super weird. I wanna show y'all.

DnPlas commented 1 year ago

Thanks for reporting this @millerhooks, and thanks for the ideas and proving all this insight. Ideally we wouldn't have to provide such a workaround in our documentation, and we'll definitely allocate some time to investigate further. The tricky part as you mentioned is to be able to benchmark and spot the actual issue and where it comes from. We shall provide some updates soon.