Closed hivanco closed 3 years ago
I believe I'm on track to have this functional by Wednesday night. What I've currently checked in is already the bare minimum I think, where "training" the model is actually just taking the average of the last few data points to make a prediction, so if all else fails we have that. @t-lin how much time would you need to run and document your experiments? Or, I guess really what I'm wondering is if I continue improving this till Thursday or Friday would you still have enough time at that point?
That said, I did just have an idea where I could easily re-use everything I'm done so far but make things much more interesting, which is instead of streaming CPU usage, we could pretend our edge devices are cameras, and stream images. There are tons of ML datasets with thousands or millions of images. I just replace streaming CPU metrics with streaming randomly selected images. Images should take up more bandwidth on the network. The "data aggregator/processing service" I mentioned above could do some actually useful data cleaning and processing tasks (eg. images from different cameras are different sizes, so it can crop image before sending them to the ML training service). ML training for images will be a heavier computation, and I'd say sounds more interesting. I think I could swap CPU with images for Wednesday night. The data cleaning stuff, well that depends.
@hivanco It doesn't have to be overly complicated, so don't worry about improving models too much. I think by Thursday/early Friday should be okay... I just finished a preliminary copy of my magazine paper and I'll switch track to this project tomorrow (today). I'll probably spend Wednesday cutting down the existing paper and editing it. If the images thing won't take too long, and you know how to set it up, then go for it.
Do you have a setup to simulate the streaming data and aggregation? Or is that still a todo item?
Yeah data streaming and aggregation is done.
Also have a modified version of that Fibonacci bubble sorting program for stressing the CPU which I'm using to make the measured CPU utilization look more interesting. I'm not sure how realistic it is, but if you look at cpu-usage/predictor/raw_cpu_usage.txt you can see an example of the CPU data it results in.
It seems to be generally cyclic tasks... so I'd say it can be a good stand-in for some type of periodic computation tasks, through the periodicity does seem to vary between 5-10s. Here's the first 5 mins plotted:
More or less done, but having problem testing the predictor because the image is so large (thanks PyTorch). It's failing somewhere in the registry-cli build flow, looks like when it calls the function to save a docker image and get its hash, running out of memory. Might have to hack it for now to bypass the hashing and use a hardcoded ID. Will test properly tomorrow. Individual components tested by curling works.
I assume you're referring to the predictor? If so, then yeah a fake ID should work for now, and it can probably be manually run.
Right now my planned deployment setup (I have to finish this) is relatively static except for the aggregator which gets deleted/created (i.e. the "bus" that moves" out of range, comes into range).
@hivanco I'm working on a setup right now... Just to verify, the proxy for the sensors should be run in client mode right? Not in service mode like in the hello-world example?
It'll work in either mode. Of course, there's no reason for the sensor to expect an incoming request. However, I did make it start its own http server and respond to all request with "OK", so you can query its alive or if you want to use the system to automatically place sensors.
I did add a --content-id
Kind of hard to test because everything dies within 60 seconds
Michael should be working on a fix for that right about now. You can check out an earlier version before the 60s thing and use that proxy & allocator (that's what I did).
I saw the README, thanks for that. I'm almost ready to test... I'm setting up a deployment with 100 sensors (could be more). Will let you know if I run into any issues or have questions.
Jesus the predictor is a huge image lol, I thought it was frozen... nope, just downloading.
What's the cause for the huge size?
I blame pytorch; the python ML library. You can find the images I've built here: https://hub.docker.com/u/hivanco. Says the compressed size is 1GB lol.
I think I've updated the readme with all the info you need to run everything. I haven't actually tested with the old allocator yet, so I've only seen that after starting a sensor, aggregator and predictor come up automatically, and die. The new proxies should still work with the old allocator so shouldn't need to supply the old proxy when building the services. @michaelweiyuzhao what's the newest allocator that would work with the latest proxy, but not have the bug?
Any version before 7f9cd9ee8bc8fb9881b9fcaa0ddbc195b5686c0c don't have the change and should work, it's just that without the changes your services won't be culled.
@hivanco Do you have time for a quick zoom chat?
Yeah, just eating dinner, maybe 7:45?
Yeah, just eating dinner, maybe 7:45?
That's fine, no rush. E-mail me when you're done.
Some updates:
Closing this issue. The app pipeline is good, and it allowed us to do some interesting experiments to emulate mobile edge environment.
Addressing reviewer critiques such as, "This paper lacks performance evaluations for more realistic applications that requires service discovery and allocation on dynamic edge networks, such as i) application that requires (distributed) big data processing, ii) real-time steaming service, and iii) service for heavy computation with specific processing power."
Planning to deploy edge "sensors", which stream CPU utilization data to some data aggregators/processing service, which send that data to some app doing machine learning training on the data to predict what the future CPU utilization would be. Time permitting, could do things like create one app for ML training, another just for doing inference. Could process other data like memory utilization. Currently not too concerned with how accurate the machine learning model is or what exactly it is predicting, point being we have an app requiring heavy processing power, but could invest time into making sure it is accurate and useful.