PhysarumSM / demos

Example services that interact with the PhysarumSM system
Apache License 2.0
0 stars 0 forks source link

More "realistic" application #2

Closed hivanco closed 3 years ago

hivanco commented 3 years ago

Addressing reviewer critiques such as, "This paper lacks performance evaluations for more realistic applications that requires service discovery and allocation on dynamic edge networks, such as i) application that requires (distributed) big data processing, ii) real-time steaming service, and iii) service for heavy computation with specific processing power."

Planning to deploy edge "sensors", which stream CPU utilization data to some data aggregators/processing service, which send that data to some app doing machine learning training on the data to predict what the future CPU utilization would be. Time permitting, could do things like create one app for ML training, another just for doing inference. Could process other data like memory utilization. Currently not too concerned with how accurate the machine learning model is or what exactly it is predicting, point being we have an app requiring heavy processing power, but could invest time into making sure it is accurate and useful.

hivanco commented 3 years ago

I believe I'm on track to have this functional by Wednesday night. What I've currently checked in is already the bare minimum I think, where "training" the model is actually just taking the average of the last few data points to make a prediction, so if all else fails we have that. @t-lin how much time would you need to run and document your experiments? Or, I guess really what I'm wondering is if I continue improving this till Thursday or Friday would you still have enough time at that point?

That said, I did just have an idea where I could easily re-use everything I'm done so far but make things much more interesting, which is instead of streaming CPU usage, we could pretend our edge devices are cameras, and stream images. There are tons of ML datasets with thousands or millions of images. I just replace streaming CPU metrics with streaming randomly selected images. Images should take up more bandwidth on the network. The "data aggregator/processing service" I mentioned above could do some actually useful data cleaning and processing tasks (eg. images from different cameras are different sizes, so it can crop image before sending them to the ML training service). ML training for images will be a heavier computation, and I'd say sounds more interesting. I think I could swap CPU with images for Wednesday night. The data cleaning stuff, well that depends.

t-lin commented 3 years ago

@hivanco It doesn't have to be overly complicated, so don't worry about improving models too much. I think by Thursday/early Friday should be okay... I just finished a preliminary copy of my magazine paper and I'll switch track to this project tomorrow (today). I'll probably spend Wednesday cutting down the existing paper and editing it. If the images thing won't take too long, and you know how to set it up, then go for it.

Do you have a setup to simulate the streaming data and aggregation? Or is that still a todo item?

hivanco commented 3 years ago

Yeah data streaming and aggregation is done.

Also have a modified version of that Fibonacci bubble sorting program for stressing the CPU which I'm using to make the measured CPU utilization look more interesting. I'm not sure how realistic it is, but if you look at cpu-usage/predictor/raw_cpu_usage.txt you can see an example of the CPU data it results in.

t-lin commented 3 years ago

It seems to be generally cyclic tasks... so I'd say it can be a good stand-in for some type of periodic computation tasks, through the periodicity does seem to vary between 5-10s. Here's the first 5 mins plotted: image

hivanco commented 3 years ago

More or less done, but having problem testing the predictor because the image is so large (thanks PyTorch). It's failing somewhere in the registry-cli build flow, looks like when it calls the function to save a docker image and get its hash, running out of memory. Might have to hack it for now to bypass the hashing and use a hardcoded ID. Will test properly tomorrow. Individual components tested by curling works.

t-lin commented 3 years ago

I assume you're referring to the predictor? If so, then yeah a fake ID should work for now, and it can probably be manually run.

Right now my planned deployment setup (I have to finish this) is relatively static except for the aggregator which gets deleted/created (i.e. the "bus" that moves" out of range, comes into range).

t-lin commented 3 years ago

@hivanco I'm working on a setup right now... Just to verify, the proxy for the sensors should be run in client mode right? Not in service mode like in the hello-world example?

hivanco commented 3 years ago

It'll work in either mode. Of course, there's no reason for the sensor to expect an incoming request. However, I did make it start its own http server and respond to all request with "OK", so you can query its alive or if you want to use the system to automatically place sensors.

hivanco commented 3 years ago

I did add a --content-id option to registry-cli add, and successfully built and pushed the predictor, and seen it allocate automatically. Kind of hard to test because everything dies within 60 seconds, but I am testing it right now and updating the readme with instructions on how everything works. If working properly, once you start the sensors, everything else should happen automatically.

t-lin commented 3 years ago

Kind of hard to test because everything dies within 60 seconds

Michael should be working on a fix for that right about now. You can check out an earlier version before the 60s thing and use that proxy & allocator (that's what I did).

I saw the README, thanks for that. I'm almost ready to test... I'm setting up a deployment with 100 sensors (could be more). Will let you know if I run into any issues or have questions.

t-lin commented 3 years ago

Jesus the predictor is a huge image lol, I thought it was frozen... nope, just downloading.

What's the cause for the huge size?

hivanco commented 3 years ago

I blame pytorch; the python ML library. You can find the images I've built here: https://hub.docker.com/u/hivanco. Says the compressed size is 1GB lol.

hivanco commented 3 years ago

I think I've updated the readme with all the info you need to run everything. I haven't actually tested with the old allocator yet, so I've only seen that after starting a sensor, aggregator and predictor come up automatically, and die. The new proxies should still work with the old allocator so shouldn't need to supply the old proxy when building the services. @michaelweiyuzhao what's the newest allocator that would work with the latest proxy, but not have the bug?

mwyzhao commented 3 years ago

Any version before 7f9cd9ee8bc8fb9881b9fcaa0ddbc195b5686c0c don't have the change and should work, it's just that without the changes your services won't be culled.

t-lin commented 3 years ago

@hivanco Do you have time for a quick zoom chat?

hivanco commented 3 years ago

Yeah, just eating dinner, maybe 7:45?

t-lin commented 3 years ago

Yeah, just eating dinner, maybe 7:45?

That's fine, no rush. E-mail me when you're done.

t-lin commented 3 years ago

Some updates:

  1. Paper got another two week extension (seen from the submission portal). Not that I'm going to wait until then (I've got a thesis to start writing), but it means I can debug and run this experiment a bit more carefully.
  2. I've verified that when an allocator comes "back into range" (or online), the sensors flood it, and tons of aggregators get created in a short amount of time. One of my nodes almost completely froze up from the lack of memory. I've modified the code a bit so the POST to the aggregator is part of the main loop, and not in a goroutine.
  3. I'm also facing another odd issues with the DHT not being able to find aggregators even after they boot up, so in that time the sensors (even if I use a single sensor) keep creating new aggregators. Not sure how to solve this one other than restarting the bootstrap and allocators every once in a while... seems like an issue in the libp2p DHT implementation, or perhaps there're some optimization configurations we're missing.
t-lin commented 3 years ago

Closing this issue. The app pipeline is good, and it allowed us to do some interesting experiments to emulate mobile edge environment.