MusicConnectionMachine / Relationships

GNU Affero General Public License v3.0
9 stars 1 forks source link

Azure Setup #18

Closed Henni closed 7 years ago

Henni commented 7 years ago

as currently discussed at https://github.com/MusicConnectionMachine/RelationshipsG4/issues/34

simonzachau commented 7 years ago

Currently, running it on virtual machines (classic) is fine for debugging/testing. On Monday we should be capable of pushing big data through our chain -> scalability necessary -> Web app on linux (preview) for scaling out

kordianbruck commented 7 years ago

So basically we have two options here:

  1. Setup an automated way to create static VMs and initialize them with the Dockerfile using cloud-init
  2. Using the auto scaling feature of azure

From what I gather the relationship group will scale by having distinct subsets of workloads. Meaning VM 1 gets WET files 1-100, VM 101-200, etc. Is my assumption correct? How does this distribution work for auto scaling? ENV vars?

Second point of interest: Some algorithms run faster than others, can both methods adapt to that?

Third point: is there a limit for auto scaling (yes, I think its 20 - please confirm)

Fourth point: we need to have a intermediate NodeJS doing authentication before starting the algorithm

ansjin commented 7 years ago

The problem here I am facing is that our algorithms use Java underneath and I was not able to find a way to use them with the Web App Linux(Preview). I tried using the docker container image to it but it seems to be not working. Here https://docs.microsoft.com/en-us/azure/app-service-web/app-service-linux-intro also they have not written anything about running a java app.

But with the same image if I use on the Linux VM it works fine. I have mailed them about this lets see what they reply.

About other things :

Currently our plan is to have different algorithms running on different VMs and one VM for the main application(get the WET file, parse them, pass it to algorithms and store back the results to DB).

I think the bottleneck for our application will not be the main application but the VMs on which the algorithms will be running. So we have to add the auto-scaling policy among those VMs based upon the CPU usage. Like if for a VM CPU usage has increased above 80% then add one more instance. Also on top of these VMs we then need to add a load-balancer so that the main application has a single address to send the data to and the load-balancer underneath takes care to send it to associated VMs.

Second point: Currently its like main application send the data to all the algorithms and the main application waits for their replies. It doesn't process the next query until it get replies within a particular time from all the algorithms. @Sandr00 can you confirm this ?

Third Point: I am not sure which service we will be using. Currently looks like the mix of both services so we have to check the scaling feature of both services.

Fourth Point: I am not sure about it. Do you mean something like a handshake mechanism between the main application and algorithm applications ?

Sandr0x00 commented 7 years ago

Yes, but we can add multiple calls easily. We use a queue and at the moment we do one call per algorithm at a time.

But we can "scale" that up. It's just config.

simonzachau commented 7 years ago

The limit for instances for scaling out web app on linux (preview) machines is 10. It doesn't matter if you turn on automatic scaling or set it manually. Regarding your question yesterday if automatic scaling works, I tried to test it but it's hard to see how many instances are actually running (there's only a number for the average over a set period of time). I also started an azure performance test which ran better with more instances. Just in case that automatic scaling does not work, we can just manually scale it out to the maximum (the mentioned 10 instances). Currently, I went with the S3 plan for scaling up. So when scaling out to the maximum we have 10x S3. Since each algorithm can have its own app service (given that @ansjin is able to get former group3's algorithms to run on app services) we are able to give each app service the power the algorithm needs (e.g. the coreference algorithm might need 10 instances, a cheaper one might be fine with less instances). As far as I understood in the Microsoft presentation we were told that the instances are load balanced. So requests that we send to our algorithms services are automatically distributed. This is one of the advantages an app service has over manually configured VMs. Our main app is currently not scaled out.

Regarding former group4's Open IE algorithm it works on an app service, but is currently public. I'm thinking of a way to implement a static api key via environment variables. Does that answer your question about ENV vars?

vviro commented 7 years ago

@simonzachau Do you have an estimate for the time needed to process one GB of WET files using the app service setup with 10? Is processing the output of group unstructured data group in a reasonable amount of time feasible?

simonzachau commented 7 years ago

@vviro unfortunately, we don't have statistics yet. @Sandr00 worked on the WET files but as far as I know he didn't call our app service yet?

vviro commented 7 years ago

Having estimates is absolutely critical, otherwise we will most probably run aground and be able to process maybe 0.1% of the data set - who knows? I'm worried that 10 containers are not nearly enough for what we are doing here. If the app service doesn't allow for more, we must go another route (the self-managed VMs).

simonzachau commented 7 years ago

@Sandr00 is your chain able to send requests to former group4's Open IE app service? If not, please tell me what's missing so I can help you get this going. We need this in order to provide statistics.

simonzachau commented 7 years ago

Today: @Sandr00 and I tried to integrate former group4's Open IE into the callChain in order to get some statistics for the app service solution. Somehow, the service actually isn't called (at least no console.log in the algorithm is printed), although it works if we change algorithms/openie-stanford/app.js to start it up with a static text variable. Our effort can be found on the adapt-format-of-stanford-open-ie branch. @ansjin could you please have a look at how it differs from former group3's algorithms / if you have an idea what's missing?

ansjin commented 7 years ago

@simonzachau I will check and fix it up!

kordianbruck commented 7 years ago

@ansjin any progress?

Sandr0x00 commented 7 years ago

We had all algorithms running on Azure before it went down and already sent data to them and got responses. As of now, we are dependent on G1 and G2 to fill the DB on Azure (when it's up again), then we will take their data and call our 5 machines with it and push everything to the DB. I hope that does not spend too much money. We will not call our algorithms before that time again to save money. They all are tested and working.

ansjin commented 7 years ago

@kordianbruck

About the algortihms : All the algorithms were deployed on the Azure(before the money got finished up) with each algorithm on a single VM. The working of this was tested based upon the data given by team 2 to @Sandr00 locally. @Sandr00 was able to push the relationships and the date events data locally to db.

And the next thing was to scale up those algorithms, me and @simonzachau were trying to scale up but before we could really test the scaling up the money got finished.

For the time we will not scale up the algorithms, we will just singe VM for each algorithm. Hopefully that doesn't consume up all the money.

ansjin commented 7 years ago

Currently running all the algorithms on google compute engine.

capture2

Also there is limitation that we can have no more than 8 cores (or Virtual CPUs) running at the same time. So will not be able to test the scalability. Later will be shifted to azure, when it will be back.

ansjin commented 7 years ago

Small-Scale-Up testing: https://github.com/MusicConnectionMachine/Relationships/issues/59#issuecomment-293935295

kordianbruck commented 7 years ago

Looks good - how does google's interface compare to azure? :stuck_out_tongue:

simonzachau commented 7 years ago

@kordianbruck it does what it should and doesn't need extra clicks to show you what's running / consuming money xD

ansjin commented 7 years ago

@kordianbruck The required things are easily available without much hassle as its in Azure and the $300 free money :D

kordianbruck commented 7 years ago

Yeeeeeeeea - I know - what a bummer. Come join the seminar next semester again, we will hopefully have the G-Sponsorship by then :grimacing:

Sandr0x00 commented 7 years ago

Can we close that? Or is there something left to do? @kordianbruck @ansjin @simonzachau

kordianbruck commented 7 years ago

No, I think you guys are done with this. Autoscaling is working right?

simonzachau commented 7 years ago

With Kubernetes yes. Done from my point of view. Only thing that is still undergoing is the queue setup I think.